All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible bug: detached mounts difficult to cleanup
@ 2017-01-11  1:24 Krister Johansen
  2017-01-11  2:27 ` Eric W. Biederman
       [not found] ` <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
  0 siblings, 2 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-11  1:24 UTC (permalink / raw)
  To: Eric W. Biederman, Al Viro; +Cc: linux-fsdevel, containers

Gents,
This is the follow-up e-mail I referenced in our discussion about the
put_mountpoint locking problem.

The problem manifested itself as a situation where our container
provisioner would sometimes fail to re-start a container that it had
made configuration changes.  The IP address chosen by the provisioner
was still in use in another container.  This meant that the system had a
network namespace with an IP address that was still in use, despite the
provisoner having torn down the container as part of the reconfig
operation.

In order to keep the network namespace in use while the container is
alive, the software bind mounts the net and user namespaces out of
/proc/<pid>/ns/ into a directory that's used as the top level for the
container instance.

After forcing a crash dump and looking through the results, I was able
to confirm that the only reference keeping the net namespace alive was
the one held by the dentry on the mountpoint for the nsfs mount of the
network namespace.  The problem was that the container software had
unmounted this mountpoint, so it wasn't even in the host container's
mount namespace.

Since the software was using shared mounts, the nsfs bind mount was
getting copied into the mount namespaces of any container that was
created after the nsfs bind mount was established.  However, this isn't
obvious because each new namespace executes a pivot_root(2), followed by
an immediate and subsequent umount2(MNT_DETACH) on the old part of the
root filesystem that is no longer in use.  These mounts of the nsfs bind
mount weren't visibile in the kernel debugger, because they'd been
detached from the mount namespace's mount tree.

After looking at how iproute handles net namespaces, I ran a test where
every unmount of the net nsfs bind mount was followed by a rm of that
mountpoint.  That always resulted in the mountpoint getting freed and
the refcount on the dentry going to zero.  It wsa enough for to make
forward progress on the other tasks at hand.  I was able to verify that
the nsfs refcount was getting dropped, and we were going through the
__detach_mounts() cleanup path:

rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
            7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
            7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
            7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
            7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
            7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
            7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
            7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
            7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
            7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
            7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
            7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
            7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
            7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
            7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
            7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
            7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
            7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
                   e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)

Over the holiday, I had some more time to debug this and was able to
narrow it down to the following case.

1. The mount namespace that gets a copy of the nsfs bind mount must be
created in a different user namespace than the host container.  This
causes MNT_LOCKED to get set on the cloned mounts.

2. In the container, pivot_root(2) and then umount2(MNT_DETACH) the old
part of the tree from pivot_root.  Ensure that the nsfs mount is beneath
the root of this tree.

3. Umount the nsfs mount in the host container.  If the mount wasn't
locked in the other container, you'll see a kprobe on nsfs_evict trigger
immediately.  If it was MNT_LOCKED, then you'll need to rm the
mountpoint in the host to trigger the nsfs_evict.

For a nsfs mount, it's not particularly problematic to have to rm the
mount to clean it up, but the other mounts in the tree that are detached
and locked are often on mountpoints that can't be easily rm'd from the
host.  These are harder to clean up, and essentially orphaned until the
container's mount ns goes away.

It would be ideal if we could release these mounts sooner, but I'm
unsure of the best approach here.

Debugging further, I was able to see that:

a) The reason the nsfs isn't considere as part of propagate_mount_unlock
is that the 'mnt' passed to that function is the top of the mount tree
and it appears to only be considering mounts directly related to 'mnt'.

b) The change_mnt_propogation(MS_PRIVATE) at the end of the while loop
in umount_tree() is what ends up hiding these mounts from the host
container.  Once they're no longer slaved or shared, we never again
consider them as candiates for unlocking.

c) Also note that these detached mounts that aren't free'd aren't
charged against a container's ns->mounts limit, so it may be possible
for a mount ns to be using more mounts than it has officially accounted
for.

I wondered if a naive solution could re-walk the list of mounts
processed in umount_tree() and if all of the detached but locked mounts
had a refcount that indicated they're unused, they could be unlocked and
unmounted.  At least in the case of the containers I'm dealing with, the
the container software should be ensuring that nothing in the container
has a reference on anything that's under the detached portion of the
tree.  However, there's probably a better way to do this.

Thoughts?

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-11  1:24 Possible bug: detached mounts difficult to cleanup Krister Johansen
@ 2017-01-11  2:04     ` Eric W. Biederman
       [not found] ` <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
  1 sibling, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-01-11  2:04 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Al Viro

Krister Johansen <kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> writes:

> Gents,
> This is the follow-up e-mail I referenced in our discussion about the
> put_mountpoint locking problem.
>
> The problem manifested itself as a situation where our container
> provisioner would sometimes fail to re-start a container that it had
> made configuration changes.  The IP address chosen by the provisioner
> was still in use in another container.  This meant that the system had a
> network namespace with an IP address that was still in use, despite the
> provisoner having torn down the container as part of the reconfig
> operation.
>
> In order to keep the network namespace in use while the container is
> alive, the software bind mounts the net and user namespaces out of
> /proc/<pid>/ns/ into a directory that's used as the top level for the
> container instance.
>
> After forcing a crash dump and looking through the results, I was able
> to confirm that the only reference keeping the net namespace alive was
> the one held by the dentry on the mountpoint for the nsfs mount of the
> network namespace.  The problem was that the container software had
> unmounted this mountpoint, so it wasn't even in the host container's
> mount namespace.
>
> Since the software was using shared mounts, the nsfs bind mount was
> getting copied into the mount namespaces of any container that was
> created after the nsfs bind mount was established.  However, this isn't
> obvious because each new namespace executes a pivot_root(2), followed by
> an immediate and subsequent umount2(MNT_DETACH) on the old part of the
> root filesystem that is no longer in use.  These mounts of the nsfs bind
> mount weren't visibile in the kernel debugger, because they'd been
> detached from the mount namespace's mount tree.
>
> After looking at how iproute handles net namespaces, I ran a test where
> every unmount of the net nsfs bind mount was followed by a rm of that
> mountpoint.  That always resulted in the mountpoint getting freed and
> the refcount on the dentry going to zero.  It wsa enough for to make
> forward progress on the other tasks at hand.  I was able to verify that
> the nsfs refcount was getting dropped, and we were going through the
> __detach_mounts() cleanup path:
>
> rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
>             7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
>             7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
>             7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
>             7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
>             7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
>             7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
>             7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
>             7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
>             7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
>             7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
>             7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
>             7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
>             7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
>             7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
>             7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
>             7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
>             7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
>                    e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
>
> Over the holiday, I had some more time to debug this and was able to
> narrow it down to the following case.
>
> 1. The mount namespace that gets a copy of the nsfs bind mount must be
> created in a different user namespace than the host container.  This
> causes MNT_LOCKED to get set on the cloned mounts.
>
> 2. In the container, pivot_root(2) and then umount2(MNT_DETACH) the old
> part of the tree from pivot_root.  Ensure that the nsfs mount is beneath
> the root of this tree.
>
> 3. Umount the nsfs mount in the host container.  If the mount wasn't
> locked in the other container, you'll see a kprobe on nsfs_evict trigger
> immediately.  If it was MNT_LOCKED, then you'll need to rm the
> mountpoint in the host to trigger the nsfs_evict.
>
> For a nsfs mount, it's not particularly problematic to have to rm the
> mount to clean it up, but the other mounts in the tree that are detached
> and locked are often on mountpoints that can't be easily rm'd from the
> host.  These are harder to clean up, and essentially orphaned until the
> container's mount ns goes away.
>
> It would be ideal if we could release these mounts sooner, but I'm
> unsure of the best approach here.
>
> Debugging further, I was able to see that:
>
> a) The reason the nsfs isn't considere as part of propagate_mount_unlock
> is that the 'mnt' passed to that function is the top of the mount tree
> and it appears to only be considering mounts directly related to 'mnt'.
>
> b) The change_mnt_propogation(MS_PRIVATE) at the end of the while loop
> in umount_tree() is what ends up hiding these mounts from the host
> container.  Once they're no longer slaved or shared, we never again
> consider them as candiates for unlocking.
>
> c) Also note that these detached mounts that aren't free'd aren't
> charged against a container's ns->mounts limit, so it may be possible
> for a mount ns to be using more mounts than it has officially accounted
> for.
>
> I wondered if a naive solution could re-walk the list of mounts
> processed in umount_tree() and if all of the detached but locked mounts
> had a refcount that indicated they're unused, they could be unlocked and
> unmounted.  At least in the case of the containers I'm dealing with, the
> the container software should be ensuring that nothing in the container
> has a reference on anything that's under the detached portion of the
> tree.  However, there's probably a better way to do this.
>
> Thoughts?

Any chance you have a trivial reproducer script?

From you description I don't quite see the problem.  I know where to
look but if could give a script that reproduces the conditions you
see that would make it easier for me to dig into, and would certainly
would remove ambiguity.   Ideally such a script would be runnable
under unshare -Urm for easy repeated testing.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
@ 2017-01-11  2:04     ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-01-11  2:04 UTC (permalink / raw)
  To: Krister Johansen; +Cc: Al Viro, linux-fsdevel, containers

Krister Johansen <kjlx@templeofstupid.com> writes:

> Gents,
> This is the follow-up e-mail I referenced in our discussion about the
> put_mountpoint locking problem.
>
> The problem manifested itself as a situation where our container
> provisioner would sometimes fail to re-start a container that it had
> made configuration changes.  The IP address chosen by the provisioner
> was still in use in another container.  This meant that the system had a
> network namespace with an IP address that was still in use, despite the
> provisoner having torn down the container as part of the reconfig
> operation.
>
> In order to keep the network namespace in use while the container is
> alive, the software bind mounts the net and user namespaces out of
> /proc/<pid>/ns/ into a directory that's used as the top level for the
> container instance.
>
> After forcing a crash dump and looking through the results, I was able
> to confirm that the only reference keeping the net namespace alive was
> the one held by the dentry on the mountpoint for the nsfs mount of the
> network namespace.  The problem was that the container software had
> unmounted this mountpoint, so it wasn't even in the host container's
> mount namespace.
>
> Since the software was using shared mounts, the nsfs bind mount was
> getting copied into the mount namespaces of any container that was
> created after the nsfs bind mount was established.  However, this isn't
> obvious because each new namespace executes a pivot_root(2), followed by
> an immediate and subsequent umount2(MNT_DETACH) on the old part of the
> root filesystem that is no longer in use.  These mounts of the nsfs bind
> mount weren't visibile in the kernel debugger, because they'd been
> detached from the mount namespace's mount tree.
>
> After looking at how iproute handles net namespaces, I ran a test where
> every unmount of the net nsfs bind mount was followed by a rm of that
> mountpoint.  That always resulted in the mountpoint getting freed and
> the refcount on the dentry going to zero.  It wsa enough for to make
> forward progress on the other tasks at hand.  I was able to verify that
> the nsfs refcount was getting dropped, and we were going through the
> __detach_mounts() cleanup path:
>
> rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
>             7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
>             7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
>             7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
>             7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
>             7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
>             7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
>             7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
>             7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
>             7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
>             7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
>             7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
>             7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
>             7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
>             7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
>             7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
>             7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
>             7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
>                    e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
>
> Over the holiday, I had some more time to debug this and was able to
> narrow it down to the following case.
>
> 1. The mount namespace that gets a copy of the nsfs bind mount must be
> created in a different user namespace than the host container.  This
> causes MNT_LOCKED to get set on the cloned mounts.
>
> 2. In the container, pivot_root(2) and then umount2(MNT_DETACH) the old
> part of the tree from pivot_root.  Ensure that the nsfs mount is beneath
> the root of this tree.
>
> 3. Umount the nsfs mount in the host container.  If the mount wasn't
> locked in the other container, you'll see a kprobe on nsfs_evict trigger
> immediately.  If it was MNT_LOCKED, then you'll need to rm the
> mountpoint in the host to trigger the nsfs_evict.
>
> For a nsfs mount, it's not particularly problematic to have to rm the
> mount to clean it up, but the other mounts in the tree that are detached
> and locked are often on mountpoints that can't be easily rm'd from the
> host.  These are harder to clean up, and essentially orphaned until the
> container's mount ns goes away.
>
> It would be ideal if we could release these mounts sooner, but I'm
> unsure of the best approach here.
>
> Debugging further, I was able to see that:
>
> a) The reason the nsfs isn't considere as part of propagate_mount_unlock
> is that the 'mnt' passed to that function is the top of the mount tree
> and it appears to only be considering mounts directly related to 'mnt'.
>
> b) The change_mnt_propogation(MS_PRIVATE) at the end of the while loop
> in umount_tree() is what ends up hiding these mounts from the host
> container.  Once they're no longer slaved or shared, we never again
> consider them as candiates for unlocking.
>
> c) Also note that these detached mounts that aren't free'd aren't
> charged against a container's ns->mounts limit, so it may be possible
> for a mount ns to be using more mounts than it has officially accounted
> for.
>
> I wondered if a naive solution could re-walk the list of mounts
> processed in umount_tree() and if all of the detached but locked mounts
> had a refcount that indicated they're unused, they could be unlocked and
> unmounted.  At least in the case of the containers I'm dealing with, the
> the container software should be ensuring that nothing in the container
> has a reference on anything that's under the detached portion of the
> tree.  However, there's probably a better way to do this.
>
> Thoughts?

Any chance you have a trivial reproducer script?

>From you description I don't quite see the problem.  I know where to
look but if could give a script that reproduces the conditions you
see that would make it easier for me to dig into, and would certainly
would remove ambiguity.   Ideally such a script would be runnable
under unshare -Urm for easy repeated testing.

Eric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
       [not found] ` <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
  2017-01-11  2:04     ` Eric W. Biederman
@ 2017-01-11  2:27   ` Eric W. Biederman
  1 sibling, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-01-11  2:27 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Al Viro

Krister Johansen <kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> writes:

> Gents,
>
> I wondered if a naive solution could re-walk the list of mounts
> processed in umount_tree() and if all of the detached but locked mounts
> had a refcount that indicated they're unused, they could be unlocked and
> unmounted.  At least in the case of the containers I'm dealing with, the
> the container software should be ensuring that nothing in the container
> has a reference on anything that's under the detached portion of the
> tree.  However, there's probably a better way to do this.

So if the code is working correctly that should already happen.

The design is for the parent mount to hold a reference to the submounts.
And when the reference on the parent drops to 0.  The references on
all of the submounts will also be dropped.

I was hoping to read the code and point it out to you quickly, but I am
not seeing it now.  I am wondering if in all of the refactoring of that
code something was dropped/missed :(

Somewhere there is supposed to be the equivalent of:
	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
when we unhash those mounts because the last count has gone away.
Either it is very sophisticated or I am missing it.  Grr....

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-11  1:24 Possible bug: detached mounts difficult to cleanup Krister Johansen
@ 2017-01-11  2:27 ` Eric W. Biederman
  2017-01-11  2:51   ` Al Viro
       [not found]   ` <87fukqwcue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
       [not found] ` <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
  1 sibling, 2 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-01-11  2:27 UTC (permalink / raw)
  To: Krister Johansen; +Cc: Al Viro, linux-fsdevel, containers

Krister Johansen <kjlx@templeofstupid.com> writes:

> Gents,
>
> I wondered if a naive solution could re-walk the list of mounts
> processed in umount_tree() and if all of the detached but locked mounts
> had a refcount that indicated they're unused, they could be unlocked and
> unmounted.  At least in the case of the containers I'm dealing with, the
> the container software should be ensuring that nothing in the container
> has a reference on anything that's under the detached portion of the
> tree.  However, there's probably a better way to do this.

So if the code is working correctly that should already happen.

The design is for the parent mount to hold a reference to the submounts.
And when the reference on the parent drops to 0.  The references on
all of the submounts will also be dropped.

I was hoping to read the code and point it out to you quickly, but I am
not seeing it now.  I am wondering if in all of the refactoring of that
code something was dropped/missed :(

Somewhere there is supposed to be the equivalent of:
	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
when we unhash those mounts because the last count has gone away.
Either it is very sophisticated or I am missing it.  Grr....

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-11  2:27 ` Eric W. Biederman
@ 2017-01-11  2:37       ` Eric W. Biederman
       [not found]   ` <87fukqwcue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-01-11  2:37 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Al Viro

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> Krister Johansen <kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> writes:
>
>> Gents,
>>
>> I wondered if a naive solution could re-walk the list of mounts
>> processed in umount_tree() and if all of the detached but locked mounts
>> had a refcount that indicated they're unused, they could be unlocked and
>> unmounted.  At least in the case of the containers I'm dealing with, the
>> the container software should be ensuring that nothing in the container
>> has a reference on anything that's under the detached portion of the
>> tree.  However, there's probably a better way to do this.
>
> So if the code is working correctly that should already happen.
>
> The design is for the parent mount to hold a reference to the submounts.
> And when the reference on the parent drops to 0.  The references on
> all of the submounts will also be dropped.
>
> I was hoping to read the code and point it out to you quickly, but I am
> not seeing it now.  I am wondering if in all of the refactoring of that
> code something was dropped/missed :(
>
> Somewhere there is supposed to be the equivalent of:
> 	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
> when we unhash those mounts because the last count has gone away.
> Either it is very sophisticated or I am missing it.  Grr....

Ok.  I see the code now, and it should be doing the right thing.

During umount_tree the code calls pin_insert_group(...) with the
last paramenter being NULL.  That adds the mount to one or two
lists.  The mnt_pins list of the parent mount and the &unmounted
hlist.

Then later when the parent's cleanup_mnt is called if the mnt_pins
still has entries mnt_pin_kill is called.  For every mount on the
mnt_pins list drop_mountpoint is called.  Which calls dput and
mntput.

So that is how your references are supposed to be freed.  Which leaves
the question why aren't your mounts being freed?  Is a file descriptor
perhaps from a mmaped executable holding a mount reference?

Perhaps you want to manually unmount everything and see if unmount will
fail on some mount and let you see which mount has a reference to it.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
@ 2017-01-11  2:37       ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-01-11  2:37 UTC (permalink / raw)
  To: Krister Johansen; +Cc: Al Viro, linux-fsdevel, containers

ebiederm@xmission.com (Eric W. Biederman) writes:

> Krister Johansen <kjlx@templeofstupid.com> writes:
>
>> Gents,
>>
>> I wondered if a naive solution could re-walk the list of mounts
>> processed in umount_tree() and if all of the detached but locked mounts
>> had a refcount that indicated they're unused, they could be unlocked and
>> unmounted.  At least in the case of the containers I'm dealing with, the
>> the container software should be ensuring that nothing in the container
>> has a reference on anything that's under the detached portion of the
>> tree.  However, there's probably a better way to do this.
>
> So if the code is working correctly that should already happen.
>
> The design is for the parent mount to hold a reference to the submounts.
> And when the reference on the parent drops to 0.  The references on
> all of the submounts will also be dropped.
>
> I was hoping to read the code and point it out to you quickly, but I am
> not seeing it now.  I am wondering if in all of the refactoring of that
> code something was dropped/missed :(
>
> Somewhere there is supposed to be the equivalent of:
> 	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
> when we unhash those mounts because the last count has gone away.
> Either it is very sophisticated or I am missing it.  Grr....

Ok.  I see the code now, and it should be doing the right thing.

During umount_tree the code calls pin_insert_group(...) with the
last paramenter being NULL.  That adds the mount to one or two
lists.  The mnt_pins list of the parent mount and the &unmounted
hlist.

Then later when the parent's cleanup_mnt is called if the mnt_pins
still has entries mnt_pin_kill is called.  For every mount on the
mnt_pins list drop_mountpoint is called.  Which calls dput and
mntput.

So that is how your references are supposed to be freed.  Which leaves
the question why aren't your mounts being freed?  Is a file descriptor
perhaps from a mmaped executable holding a mount reference?

Perhaps you want to manually unmount everything and see if unmount will
fail on some mount and let you see which mount has a reference to it.

Eric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
       [not found]   ` <87fukqwcue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2017-01-11  2:37       ` Eric W. Biederman
@ 2017-01-11  2:51     ` Al Viro
  1 sibling, 0 replies; 22+ messages in thread
From: Al Viro @ 2017-01-11  2:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed, Jan 11, 2017 at 03:27:05PM +1300, Eric W. Biederman wrote:

> The design is for the parent mount to hold a reference to the submounts.
> And when the reference on the parent drops to 0.  The references on
> all of the submounts will also be dropped.

Parent does _not_ hold any references to submounts.  Never had.  What
happens in umount_tree() is that any surviving submounts
	a) are inserted into ->mnt_pins of parent, but not into unmounted.
	a) have the reference to parent dropped.

When the last reference to parent gets dropped, all remaining submounts
get unhashed and when we get to cleanup_mnt() on parent, ->mnt_pins
gets pulled.  Which does dput() on (ex)mountpoints and mntput() on
those submounts.

> I was hoping to read the code and point it out to you quickly, but I am
> not seeing it now.  I am wondering if in all of the refactoring of that
> code something was dropped/missed :(
> 
> Somewhere there is supposed to be the equivalent of:
> 	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
> when we unhash those mounts because the last count has gone away.
> Either it is very sophisticated or I am missing it.  Grr....

What you are missing is that they never end up on 'unmounted' - only on
->mnt_pin of parent.  It's the same pin_insert_group() in umount_tree()
as for everything else, the only difference being that it does _not_
get kicked out on the namespace_unlock().

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-11  2:27 ` Eric W. Biederman
@ 2017-01-11  2:51   ` Al Viro
       [not found]   ` <87fukqwcue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 0 replies; 22+ messages in thread
From: Al Viro @ 2017-01-11  2:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Krister Johansen, linux-fsdevel, containers

On Wed, Jan 11, 2017 at 03:27:05PM +1300, Eric W. Biederman wrote:

> The design is for the parent mount to hold a reference to the submounts.
> And when the reference on the parent drops to 0.  The references on
> all of the submounts will also be dropped.

Parent does _not_ hold any references to submounts.  Never had.  What
happens in umount_tree() is that any surviving submounts
	a) are inserted into ->mnt_pins of parent, but not into unmounted.
	a) have the reference to parent dropped.

When the last reference to parent gets dropped, all remaining submounts
get unhashed and when we get to cleanup_mnt() on parent, ->mnt_pins
gets pulled.  Which does dput() on (ex)mountpoints and mntput() on
those submounts.

> I was hoping to read the code and point it out to you quickly, but I am
> not seeing it now.  I am wondering if in all of the refactoring of that
> code something was dropped/missed :(
> 
> Somewhere there is supposed to be the equivalent of:
> 	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
> when we unhash those mounts because the last count has gone away.
> Either it is very sophisticated or I am missing it.  Grr....

What you are missing is that they never end up on 'unmounted' - only on
->mnt_pin of parent.  It's the same pin_insert_group() in umount_tree()
as for everything else, the only difference being that it does _not_
get kicked out on the namespace_unlock().

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-11  2:04     ` Eric W. Biederman
@ 2017-01-11  3:07         ` Krister Johansen
  -1 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-11  3:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Al Viro

On Wed, Jan 11, 2017 at 03:04:22PM +1300, Eric W. Biederman wrote:
> Any chance you have a trivial reproducer script?
> 
> From you description I don't quite see the problem.  I know where to
> look but if could give a script that reproduces the conditions you
> see that would make it easier for me to dig into, and would certainly
> would remove ambiguity.   Ideally such a script would be runnable
> under unshare -Urm for easy repeated testing.

My apologies.  I don't have something that fits into a shell script, but
I can walk you through the simplest test case that I used when I was
debugging this.

Create net a ns:

    $ sudo unshare -n bash
    # echo $$
    2771

In another terminal bind mount that ns onto a file:

    # mkdir /run/testns
    # touch /run/testns/ns1
    # mount --bind /proc/2771/ns/net /run/testns/ns1

Back in first terminal, create a new ns, pivot root, and umount detach:

    # exit
    $ unshare -U -m -n --propagation slave --map-root-user bash
    # mkdir binddir
    # mount --bind binddir binddir
    # cp busybox binddir
    # mkdir binddir/old_root
    # cd binddir
    # pivot_root . old_root
    # ./busybox umount -l old_root

Back in second terminal:

    # umount /run/testns/ns1
[ watch for ns cleanup -- not seen if mnt is locked ]
    # rm /run/testns/ns1
[ now we see it ]


For the observability stuff, I went back and forth between using 'perf
probe' to place a kprobe on nsfs_evict, and using a bcc script to
watch events on the same kprobe.  I can send along the script, if you're
a bcc user.

At least when I debugged this, I found that when the mount was
MNT_LOCKED, disconnect_mount() returned false so the actual unmount
didn't happen until the mountpoint was rm'd in the host container.

I'm not sure if this is actually a bug, or a case where the cleanup is
just conservative.  However, it looked like in the case where we call
pivot_root, the detached mounts get marked private but otherwise aren't
in use in the container's namespace any longer.

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
@ 2017-01-11  3:07         ` Krister Johansen
  0 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-11  3:07 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Krister Johansen, Al Viro, linux-fsdevel, containers

On Wed, Jan 11, 2017 at 03:04:22PM +1300, Eric W. Biederman wrote:
> Any chance you have a trivial reproducer script?
> 
> From you description I don't quite see the problem.  I know where to
> look but if could give a script that reproduces the conditions you
> see that would make it easier for me to dig into, and would certainly
> would remove ambiguity.   Ideally such a script would be runnable
> under unshare -Urm for easy repeated testing.

My apologies.  I don't have something that fits into a shell script, but
I can walk you through the simplest test case that I used when I was
debugging this.

Create net a ns:

    $ sudo unshare -n bash
    # echo $$
    2771

In another terminal bind mount that ns onto a file:

    # mkdir /run/testns
    # touch /run/testns/ns1
    # mount --bind /proc/2771/ns/net /run/testns/ns1

Back in first terminal, create a new ns, pivot root, and umount detach:

    # exit
    $ unshare -U -m -n --propagation slave --map-root-user bash
    # mkdir binddir
    # mount --bind binddir binddir
    # cp busybox binddir
    # mkdir binddir/old_root
    # cd binddir
    # pivot_root . old_root
    # ./busybox umount -l old_root

Back in second terminal:

    # umount /run/testns/ns1
[ watch for ns cleanup -- not seen if mnt is locked ]
    # rm /run/testns/ns1
[ now we see it ]


For the observability stuff, I went back and forth between using 'perf
probe' to place a kprobe on nsfs_evict, and using a bcc script to
watch events on the same kprobe.  I can send along the script, if you're
a bcc user.

At least when I debugged this, I found that when the mount was
MNT_LOCKED, disconnect_mount() returned false so the actual unmount
didn't happen until the mountpoint was rm'd in the host container.

I'm not sure if this is actually a bug, or a case where the cleanup is
just conservative.  However, it looked like in the case where we call
pivot_root, the detached mounts get marked private but otherwise aren't
in use in the container's namespace any longer.

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-11  2:37       ` Eric W. Biederman
@ 2017-01-12  6:15           ` Krister Johansen
  -1 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-12  6:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Al Viro

On Wed, Jan 11, 2017 at 03:37:36PM +1300, Eric W. Biederman wrote:
> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
> > So if the code is working correctly that should already happen.
> >
> > The design is for the parent mount to hold a reference to the submounts.
> > And when the reference on the parent drops to 0.  The references on
> > all of the submounts will also be dropped.
> >
> > I was hoping to read the code and point it out to you quickly, but I am
> > not seeing it now.  I am wondering if in all of the refactoring of that
> > code something was dropped/missed :(
> >
> > Somewhere there is supposed to be the equivalent of:
> > 	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
> > when we unhash those mounts because the last count has gone away.
> > Either it is very sophisticated or I am missing it.  Grr....
> 
> Ok.  I see the code now, and it should be doing the right thing.
> 
> During umount_tree the code calls pin_insert_group(...) with the
> last paramenter being NULL.  That adds the mount to one or two
> lists.  The mnt_pins list of the parent mount and the &unmounted
> hlist.
> 
> Then later when the parent's cleanup_mnt is called if the mnt_pins
> still has entries mnt_pin_kill is called.  For every mount on the
> mnt_pins list drop_mountpoint is called.  Which calls dput and
> mntput.
> 
> So that is how your references are supposed to be freed.  Which leaves
> the question why aren't your mounts being freed?  Is a file descriptor
> perhaps from a mmaped executable holding a mount reference?

Was that test case of any use?  I'm afraid that I'm still failing to
communicate the problem.  The parent's cleanup_mnt isn't getting called
for the detached and locked mounts, and I can explain why.  The only
time I'm seeing them free'd is via the __detach_mounts() path, which is
only invoked for d_invalidate, vfs_rmdir, vfs_unlink, and vfs_rename:

rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
            7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
            7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
            7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
            7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
            7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
            7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
            7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
            7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
            7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
            7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
            7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
            7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
            7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
            7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
            7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
            7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
            7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
                   e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)

So that's the stack where I see it work, but I never see it go through
the cleanup_mnt() path, and here's why.  First, the code to for loop
in umount_tree():

        while (!list_empty(&tmp_list)) {
                struct mnt_namespace *ns;
                bool disconnect;
                p = list_first_entry(&tmp_list, struct mount, mnt_list);
                list_del_init(&p->mnt_expire);
                list_del_init(&p->mnt_list);
                ns = p->mnt_ns;
                if (ns) {
                        ns->mounts--;
                        __touch_mnt_namespace(ns);
                }
                p->mnt_ns = NULL;
                if (how & UMOUNT_SYNC)
                        p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
                        
  #1 --->       disconnect = disconnect_mount(p, how);

  #2 --->       pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
                                 disconnect ? &unmounted : NULL);
                if (mnt_has_parent(p)) {
                        mnt_add_count(p->mnt_parent, -1);
                        if (!disconnect) {
                                /* Don't forget about p */
                                list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
                        } else {
                                umount_mnt(p);
                        }       
                }
  #3 --->       change_mnt_propagation(p, MS_PRIVATE);
        }


So at #1 disconnect is false if p has MNT_LOCKED set.
At #2 p isn't added to the s_list on 'unmounted' if disconnect is false.

The mount gets hidden from the host container at #3, but that's not
germane to the invocation of pin_kill.

This is namespace_unlock:

        hlist_move_list(&unmounted, &head);

        up_write(&namespace_sem);

        if (likely(hlist_empty(&head)))
                return;

        synchronize_rcu();

        group_pin_kill(&head);

So unmounted is moved to head, and group_pin_kill is invoked on that.
Only the mounts we marked for disconnect go through the cleanup_mnt path
that way.

So that's the fundamental question I'm trying to ask.  If we have a
mount tree that's umount(MNT_DETACH)'d immediately after a pivot_root,
but it's never getting those mounts cleaned up except when their
mountpoints get rm'd or mv'd, is there a better way to clean up this
tree?

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
@ 2017-01-12  6:15           ` Krister Johansen
  0 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-12  6:15 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Krister Johansen, Al Viro, linux-fsdevel, containers

On Wed, Jan 11, 2017 at 03:37:36PM +1300, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> > So if the code is working correctly that should already happen.
> >
> > The design is for the parent mount to hold a reference to the submounts.
> > And when the reference on the parent drops to 0.  The references on
> > all of the submounts will also be dropped.
> >
> > I was hoping to read the code and point it out to you quickly, but I am
> > not seeing it now.  I am wondering if in all of the refactoring of that
> > code something was dropped/missed :(
> >
> > Somewhere there is supposed to be the equivalent of:
> > 	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
> > when we unhash those mounts because the last count has gone away.
> > Either it is very sophisticated or I am missing it.  Grr....
> 
> Ok.  I see the code now, and it should be doing the right thing.
> 
> During umount_tree the code calls pin_insert_group(...) with the
> last paramenter being NULL.  That adds the mount to one or two
> lists.  The mnt_pins list of the parent mount and the &unmounted
> hlist.
> 
> Then later when the parent's cleanup_mnt is called if the mnt_pins
> still has entries mnt_pin_kill is called.  For every mount on the
> mnt_pins list drop_mountpoint is called.  Which calls dput and
> mntput.
> 
> So that is how your references are supposed to be freed.  Which leaves
> the question why aren't your mounts being freed?  Is a file descriptor
> perhaps from a mmaped executable holding a mount reference?

Was that test case of any use?  I'm afraid that I'm still failing to
communicate the problem.  The parent's cleanup_mnt isn't getting called
for the detached and locked mounts, and I can explain why.  The only
time I'm seeing them free'd is via the __detach_mounts() path, which is
only invoked for d_invalidate, vfs_rmdir, vfs_unlink, and vfs_rename:

rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
            7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
            7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
            7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
            7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
            7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
            7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
            7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
            7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
            7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
            7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
            7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
            7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
            7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
            7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
            7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
            7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
            7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
                   e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)

So that's the stack where I see it work, but I never see it go through
the cleanup_mnt() path, and here's why.  First, the code to for loop
in umount_tree():

        while (!list_empty(&tmp_list)) {
                struct mnt_namespace *ns;
                bool disconnect;
                p = list_first_entry(&tmp_list, struct mount, mnt_list);
                list_del_init(&p->mnt_expire);
                list_del_init(&p->mnt_list);
                ns = p->mnt_ns;
                if (ns) {
                        ns->mounts--;
                        __touch_mnt_namespace(ns);
                }
                p->mnt_ns = NULL;
                if (how & UMOUNT_SYNC)
                        p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
                        
  #1 --->       disconnect = disconnect_mount(p, how);

  #2 --->       pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
                                 disconnect ? &unmounted : NULL);
                if (mnt_has_parent(p)) {
                        mnt_add_count(p->mnt_parent, -1);
                        if (!disconnect) {
                                /* Don't forget about p */
                                list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
                        } else {
                                umount_mnt(p);
                        }       
                }
  #3 --->       change_mnt_propagation(p, MS_PRIVATE);
        }


So at #1 disconnect is false if p has MNT_LOCKED set.
At #2 p isn't added to the s_list on 'unmounted' if disconnect is false.

The mount gets hidden from the host container at #3, but that's not
germane to the invocation of pin_kill.

This is namespace_unlock:

        hlist_move_list(&unmounted, &head);

        up_write(&namespace_sem);

        if (likely(hlist_empty(&head)))
                return;

        synchronize_rcu();

        group_pin_kill(&head);

So unmounted is moved to head, and group_pin_kill is invoked on that.
Only the mounts we marked for disconnect go through the cleanup_mnt path
that way.

So that's the fundamental question I'm trying to ask.  If we have a
mount tree that's umount(MNT_DETACH)'d immediately after a pivot_root,
but it's never getting those mounts cleaned up except when their
mountpoints get rm'd or mv'd, is there a better way to clean up this
tree?

-K


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-12  6:15           ` Krister Johansen
@ 2017-01-12  8:26               ` Eric W. Biederman
  -1 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-01-12  8:26 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Al Viro

Krister Johansen <kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> writes:

> On Wed, Jan 11, 2017 at 03:37:36PM +1300, Eric W. Biederman wrote:
>> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>> > So if the code is working correctly that should already happen.
>> >
>> > The design is for the parent mount to hold a reference to the submounts.
>> > And when the reference on the parent drops to 0.  The references on
>> > all of the submounts will also be dropped.
>> >
>> > I was hoping to read the code and point it out to you quickly, but I am
>> > not seeing it now.  I am wondering if in all of the refactoring of that
>> > code something was dropped/missed :(
>> >
>> > Somewhere there is supposed to be the equivalent of:
>> > 	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
>> > when we unhash those mounts because the last count has gone away.
>> > Either it is very sophisticated or I am missing it.  Grr....
>> 
>> Ok.  I see the code now, and it should be doing the right thing.
>> 
>> During umount_tree the code calls pin_insert_group(...) with the
>> last paramenter being NULL.  That adds the mount to one or two
>> lists.  The mnt_pins list of the parent mount and the &unmounted
>> hlist.
>> 
>> Then later when the parent's cleanup_mnt is called if the mnt_pins
>> still has entries mnt_pin_kill is called.  For every mount on the
>> mnt_pins list drop_mountpoint is called.  Which calls dput and
>> mntput.
>> 
>> So that is how your references are supposed to be freed.  Which leaves
>> the question why aren't your mounts being freed?  Is a file descriptor
>> perhaps from a mmaped executable holding a mount reference?
>
> Was that test case of any use?  I'm afraid that I'm still failing to
> communicate the problem.

I apologize I really haven't had the energy to dig into it, especially
after I read the code and the only way I could see to get the
problem you are having is for something to be retaining a reference to
the mounts.

> The parent's cleanup_mnt isn't getting called
> for the detached and locked mounts, and I can explain why.  The only
> time I'm seeing them free'd is via the __detach_mounts() path, which is
> only invoked for d_invalidate, vfs_rmdir, vfs_unlink, and vfs_rename:
>
> rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
>             7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
>             7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
>             7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
>             7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
>             7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
>             7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
>             7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
>             7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
>             7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
>             7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
>             7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
>             7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
>             7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
>             7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
>             7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
>             7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
>             7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
>                    e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
>
> So that's the stack where I see it work, but I never see it go through
> the cleanup_mnt() path, and here's why.  First, the code to for loop
> in umount_tree():
>
>         while (!list_empty(&tmp_list)) {
>                 struct mnt_namespace *ns;
>                 bool disconnect;
>                 p = list_first_entry(&tmp_list, struct mount, mnt_list);
>                 list_del_init(&p->mnt_expire);
>                 list_del_init(&p->mnt_list);
>                 ns = p->mnt_ns;
>                 if (ns) {
>                         ns->mounts--;
>                         __touch_mnt_namespace(ns);
>                 }
>                 p->mnt_ns = NULL;
>                 if (how & UMOUNT_SYNC)
>                         p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
>                         
>   #1 --->       disconnect = disconnect_mount(p, how);
>
>   #2 --->       pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
>                                  disconnect ? &unmounted : NULL);
>                 if (mnt_has_parent(p)) {
>                         mnt_add_count(p->mnt_parent, -1);
>                         if (!disconnect) {
>                                 /* Don't forget about p */
>                                 list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
>                         } else {
>                                 umount_mnt(p);
>                         }       
>                 }
>   #3 --->       change_mnt_propagation(p, MS_PRIVATE);
>         }
>
>
> So at #1 disconnect is false if p has MNT_LOCKED set.
> At #2 p isn't added to the s_list on 'unmounted' if disconnect is false.
>
> The mount gets hidden from the host container at #3, but that's not
> germane to the invocation of pin_kill.
>
> This is namespace_unlock:
>
>         hlist_move_list(&unmounted, &head);
>
>         up_write(&namespace_sem);
>
>         if (likely(hlist_empty(&head)))
>                 return;
>
>         synchronize_rcu();
>
>         group_pin_kill(&head);
>
> So unmounted is moved to head, and group_pin_kill is invoked on that.
> Only the mounts we marked for disconnect go through the cleanup_mnt path
> that way.

At which point you have an island of mounts.

In that island each submount is on it's parent's mnt_pin list.
When the last reference of a parent is dropped we call
    umount_mnt on the children from mntput_no_expire
    drop_mountpoint from mnt_pin_kill from cleanup_mnt indirectly from mntput_no_expire

So all we need is mntput_no_expire on a mount to be called for the
entire island to be freed.

So the fundamental issue appears to be that nothing is dropping the last
reference to some part of your island of mounts.

> So that's the fundamental question I'm trying to ask.  If we have a
> mount tree that's umount(MNT_DETACH)'d immediately after a pivot_root,
> but it's never getting those mounts cleaned up except when their
> mountpoints get rm'd or mv'd, is there a better way to clean up this
> tree?

SIGKILL the process that is holding a reference.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
@ 2017-01-12  8:26               ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-01-12  8:26 UTC (permalink / raw)
  To: Krister Johansen; +Cc: Al Viro, linux-fsdevel, containers

Krister Johansen <kjlx@templeofstupid.com> writes:

> On Wed, Jan 11, 2017 at 03:37:36PM +1300, Eric W. Biederman wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> > So if the code is working correctly that should already happen.
>> >
>> > The design is for the parent mount to hold a reference to the submounts.
>> > And when the reference on the parent drops to 0.  The references on
>> > all of the submounts will also be dropped.
>> >
>> > I was hoping to read the code and point it out to you quickly, but I am
>> > not seeing it now.  I am wondering if in all of the refactoring of that
>> > code something was dropped/missed :(
>> >
>> > Somewhere there is supposed to be the equivalent of:
>> > 	pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
>> > when we unhash those mounts because the last count has gone away.
>> > Either it is very sophisticated or I am missing it.  Grr....
>> 
>> Ok.  I see the code now, and it should be doing the right thing.
>> 
>> During umount_tree the code calls pin_insert_group(...) with the
>> last paramenter being NULL.  That adds the mount to one or two
>> lists.  The mnt_pins list of the parent mount and the &unmounted
>> hlist.
>> 
>> Then later when the parent's cleanup_mnt is called if the mnt_pins
>> still has entries mnt_pin_kill is called.  For every mount on the
>> mnt_pins list drop_mountpoint is called.  Which calls dput and
>> mntput.
>> 
>> So that is how your references are supposed to be freed.  Which leaves
>> the question why aren't your mounts being freed?  Is a file descriptor
>> perhaps from a mmaped executable holding a mount reference?
>
> Was that test case of any use?  I'm afraid that I'm still failing to
> communicate the problem.

I apologize I really haven't had the energy to dig into it, especially
after I read the code and the only way I could see to get the
problem you are having is for something to be retaining a reference to
the mounts.

> The parent's cleanup_mnt isn't getting called
> for the detached and locked mounts, and I can explain why.  The only
> time I'm seeing them free'd is via the __detach_mounts() path, which is
> only invoked for d_invalidate, vfs_rmdir, vfs_unlink, and vfs_rename:
>
> rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
>             7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
>             7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
>             7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
>             7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
>             7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
>             7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
>             7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
>             7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
>             7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
>             7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
>             7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
>             7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
>             7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
>             7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
>             7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
>             7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
>             7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
>                    e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
>
> So that's the stack where I see it work, but I never see it go through
> the cleanup_mnt() path, and here's why.  First, the code to for loop
> in umount_tree():
>
>         while (!list_empty(&tmp_list)) {
>                 struct mnt_namespace *ns;
>                 bool disconnect;
>                 p = list_first_entry(&tmp_list, struct mount, mnt_list);
>                 list_del_init(&p->mnt_expire);
>                 list_del_init(&p->mnt_list);
>                 ns = p->mnt_ns;
>                 if (ns) {
>                         ns->mounts--;
>                         __touch_mnt_namespace(ns);
>                 }
>                 p->mnt_ns = NULL;
>                 if (how & UMOUNT_SYNC)
>                         p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
>                         
>   #1 --->       disconnect = disconnect_mount(p, how);
>
>   #2 --->       pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
>                                  disconnect ? &unmounted : NULL);
>                 if (mnt_has_parent(p)) {
>                         mnt_add_count(p->mnt_parent, -1);
>                         if (!disconnect) {
>                                 /* Don't forget about p */
>                                 list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
>                         } else {
>                                 umount_mnt(p);
>                         }       
>                 }
>   #3 --->       change_mnt_propagation(p, MS_PRIVATE);
>         }
>
>
> So at #1 disconnect is false if p has MNT_LOCKED set.
> At #2 p isn't added to the s_list on 'unmounted' if disconnect is false.
>
> The mount gets hidden from the host container at #3, but that's not
> germane to the invocation of pin_kill.
>
> This is namespace_unlock:
>
>         hlist_move_list(&unmounted, &head);
>
>         up_write(&namespace_sem);
>
>         if (likely(hlist_empty(&head)))
>                 return;
>
>         synchronize_rcu();
>
>         group_pin_kill(&head);
>
> So unmounted is moved to head, and group_pin_kill is invoked on that.
> Only the mounts we marked for disconnect go through the cleanup_mnt path
> that way.

At which point you have an island of mounts.

In that island each submount is on it's parent's mnt_pin list.
When the last reference of a parent is dropped we call
    umount_mnt on the children from mntput_no_expire
    drop_mountpoint from mnt_pin_kill from cleanup_mnt indirectly from mntput_no_expire

So all we need is mntput_no_expire on a mount to be called for the
entire island to be freed.

So the fundamental issue appears to be that nothing is dropping the last
reference to some part of your island of mounts.

> So that's the fundamental question I'm trying to ask.  If we have a
> mount tree that's umount(MNT_DETACH)'d immediately after a pivot_root,
> but it's never getting those mounts cleaned up except when their
> mountpoints get rm'd or mv'd, is there a better way to clean up this
> tree?

SIGKILL the process that is holding a reference.

Eric




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-11  3:07         ` Krister Johansen
@ 2017-01-13  0:37             ` Andrei Vagin
  -1 siblings, 0 replies; 22+ messages in thread
From: Andrei Vagin @ 2017-01-13  0:37 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel, Linux Containers, Eric W. Biederman, Al Viro

On Tue, Jan 10, 2017 at 7:07 PM, Krister Johansen
<kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> wrote:
> On Wed, Jan 11, 2017 at 03:04:22PM +1300, Eric W. Biederman wrote:
>> Any chance you have a trivial reproducer script?
>>
>> From you description I don't quite see the problem.  I know where to
>> look but if could give a script that reproduces the conditions you
>> see that would make it easier for me to dig into, and would certainly
>> would remove ambiguity.   Ideally such a script would be runnable
>> under unshare -Urm for easy repeated testing.
>
> My apologies.  I don't have something that fits into a shell script, but
> I can walk you through the simplest test case that I used when I was
> debugging this.
>
> Create net a ns:
>
>     $ sudo unshare -n bash
>     # echo $$
>     2771
>
> In another terminal bind mount that ns onto a file:
>
>     # mkdir /run/testns
>     # touch /run/testns/ns1
>     # mount --bind /proc/2771/ns/net /run/testns/ns1
>
> Back in first terminal, create a new ns, pivot root, and umount detach:
>
>     # exit
>     $ unshare -U -m -n --propagation slave --map-root-user bash
>     # mkdir binddir
>     # mount --bind binddir binddir
>     # cp busybox binddir
>     # mkdir binddir/old_root
>     # cd binddir
>     # pivot_root . old_root
>     # ./busybox umount -l old_root

Hi,

But this process still has mappings from "old_root"
[root@fc24 busybox]# cat /proc/$$/maps
5607360f1000-5607361e9000 r-xp 00000000 fd:02 1176793
  /usr/bin/bash
5607363e8000-5607363ec000 r--p 000f7000 fd:02 1176793
  /usr/bin/bash
5607363ec000-5607363f5000 rw-p 000fb000 fd:02 1176793
  /usr/bin/bash
...

You have to call "exec ./busybox sh" to release all "old_root" mounts.
And in this case I see that a net namespace is destroyed:

[root@fc24 busybox]# cat /proc/slabinfo | /bin/grep net_name
net_namespace          5      8   6784    4    8 : tunables    0    0
  0 : slabdata      2      2      0
[root@fc24 busybox]# exec /bin/sh
/ # cat /proc/slabinfo | /bin/grep -- net
net_namespace          4      8   6784    4    8 : tunables    0    0
  0 : slabdata      2      2      0

>
> Back in second terminal:
>
>     # umount /run/testns/ns1
> [ watch for ns cleanup -- not seen if mnt is locked ]
>     # rm /run/testns/ns1
> [ now we see it ]
>
>
> For the observability stuff, I went back and forth between using 'perf
> probe' to place a kprobe on nsfs_evict, and using a bcc script to
> watch events on the same kprobe.  I can send along the script, if you're
> a bcc user.
>
> At least when I debugged this, I found that when the mount was
> MNT_LOCKED, disconnect_mount() returned false so the actual unmount
> didn't happen until the mountpoint was rm'd in the host container.
>
> I'm not sure if this is actually a bug, or a case where the cleanup is
> just conservative.  However, it looked like in the case where we call
> pivot_root, the detached mounts get marked private but otherwise aren't
> in use in the container's namespace any longer.
>
> -K
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
@ 2017-01-13  0:37             ` Andrei Vagin
  0 siblings, 0 replies; 22+ messages in thread
From: Andrei Vagin @ 2017-01-13  0:37 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Eric W. Biederman, linux-fsdevel, Linux Containers, Al Viro

On Tue, Jan 10, 2017 at 7:07 PM, Krister Johansen
<kjlx@templeofstupid.com> wrote:
> On Wed, Jan 11, 2017 at 03:04:22PM +1300, Eric W. Biederman wrote:
>> Any chance you have a trivial reproducer script?
>>
>> From you description I don't quite see the problem.  I know where to
>> look but if could give a script that reproduces the conditions you
>> see that would make it easier for me to dig into, and would certainly
>> would remove ambiguity.   Ideally such a script would be runnable
>> under unshare -Urm for easy repeated testing.
>
> My apologies.  I don't have something that fits into a shell script, but
> I can walk you through the simplest test case that I used when I was
> debugging this.
>
> Create net a ns:
>
>     $ sudo unshare -n bash
>     # echo $$
>     2771
>
> In another terminal bind mount that ns onto a file:
>
>     # mkdir /run/testns
>     # touch /run/testns/ns1
>     # mount --bind /proc/2771/ns/net /run/testns/ns1
>
> Back in first terminal, create a new ns, pivot root, and umount detach:
>
>     # exit
>     $ unshare -U -m -n --propagation slave --map-root-user bash
>     # mkdir binddir
>     # mount --bind binddir binddir
>     # cp busybox binddir
>     # mkdir binddir/old_root
>     # cd binddir
>     # pivot_root . old_root
>     # ./busybox umount -l old_root

Hi,

But this process still has mappings from "old_root"
[root@fc24 busybox]# cat /proc/$$/maps
5607360f1000-5607361e9000 r-xp 00000000 fd:02 1176793
  /usr/bin/bash
5607363e8000-5607363ec000 r--p 000f7000 fd:02 1176793
  /usr/bin/bash
5607363ec000-5607363f5000 rw-p 000fb000 fd:02 1176793
  /usr/bin/bash
...

You have to call "exec ./busybox sh" to release all "old_root" mounts.
And in this case I see that a net namespace is destroyed:

[root@fc24 busybox]# cat /proc/slabinfo | /bin/grep net_name
net_namespace          5      8   6784    4    8 : tunables    0    0
  0 : slabdata      2      2      0
[root@fc24 busybox]# exec /bin/sh
/ # cat /proc/slabinfo | /bin/grep -- net
net_namespace          4      8   6784    4    8 : tunables    0    0
  0 : slabdata      2      2      0

>
> Back in second terminal:
>
>     # umount /run/testns/ns1
> [ watch for ns cleanup -- not seen if mnt is locked ]
>     # rm /run/testns/ns1
> [ now we see it ]
>
>
> For the observability stuff, I went back and forth between using 'perf
> probe' to place a kprobe on nsfs_evict, and using a bcc script to
> watch events on the same kprobe.  I can send along the script, if you're
> a bcc user.
>
> At least when I debugged this, I found that when the mount was
> MNT_LOCKED, disconnect_mount() returned false so the actual unmount
> didn't happen until the mountpoint was rm'd in the host container.
>
> I'm not sure if this is actually a bug, or a case where the cleanup is
> just conservative.  However, it looked like in the case where we call
> pivot_root, the detached mounts get marked private but otherwise aren't
> in use in the container's namespace any longer.
>
> -K
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
       [not found]             ` <CANaxB-zMzS-euqR1_LvZSoEsO-Y6q=_qGNTJZCKZTL5WfFF16g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-01-13 23:28               ` Krister Johansen
  0 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-13 23:28 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: linux-fsdevel, Linux Containers, Eric W. Biederman, Al Viro

On Thu, Jan 12, 2017 at 04:37:13PM -0800, Andrei Vagin wrote:
> On Tue, Jan 10, 2017 at 7:07 PM, Krister Johansen
> <kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> wrote:
> > On Wed, Jan 11, 2017 at 03:04:22PM +1300, Eric W. Biederman wrote:
> >> Any chance you have a trivial reproducer script?
> >>
> >> From you description I don't quite see the problem.  I know where to
> >> look but if could give a script that reproduces the conditions you
> >> see that would make it easier for me to dig into, and would certainly
> >> would remove ambiguity.   Ideally such a script would be runnable
> >> under unshare -Urm for easy repeated testing.
> >
> > My apologies.  I don't have something that fits into a shell script, but
> > I can walk you through the simplest test case that I used when I was
> > debugging this.
> >
> > Create net a ns:
> >
> >     $ sudo unshare -n bash
> >     # echo $$
> >     2771
> >
> > In another terminal bind mount that ns onto a file:
> >
> >     # mkdir /run/testns
> >     # touch /run/testns/ns1
> >     # mount --bind /proc/2771/ns/net /run/testns/ns1
> >
> > Back in first terminal, create a new ns, pivot root, and umount detach:
> >
> >     # exit
> >     $ unshare -U -m -n --propagation slave --map-root-user bash
> >     # mkdir binddir
> >     # mount --bind binddir binddir
> >     # cp busybox binddir
> >     # mkdir binddir/old_root
> >     # cd binddir
> >     # pivot_root . old_root
> >     # ./busybox umount -l old_root
> 
> Hi,
> 
> But this process still has mappings from "old_root"
> [root@fc24 busybox]# cat /proc/$$/maps
> 5607360f1000-5607361e9000 r-xp 00000000 fd:02 1176793
>   /usr/bin/bash
> 5607363e8000-5607363ec000 r--p 000f7000 fd:02 1176793
>   /usr/bin/bash
> 5607363ec000-5607363f5000 rw-p 000fb000 fd:02 1176793
>   /usr/bin/bash
> ...
> 
> You have to call "exec ./busybox sh" to release all "old_root" mounts.
> And in this case I see that a net namespace is destroyed:
> 
> [root@fc24 busybox]# cat /proc/slabinfo | /bin/grep net_name
> net_namespace          5      8   6784    4    8 : tunables    0    0
>   0 : slabdata      2      2      0
> [root@fc24 busybox]# exec /bin/sh
> / # cat /proc/slabinfo | /bin/grep -- net
> net_namespace          4      8   6784    4    8 : tunables    0    0
>   0 : slabdata      2      2      0

Thanks.  This seems to be the part of the puzzle that I was missing.  I
went back and looked and found that the container pid 1 did have live
memory mappings to files that are mounted on the old_root.  Appreciate
the nudge in the right direction.

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-13  0:37             ` Andrei Vagin
  (?)
  (?)
@ 2017-01-13 23:28             ` Krister Johansen
  -1 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-13 23:28 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Krister Johansen, Eric W. Biederman, linux-fsdevel,
	Linux Containers, Al Viro

On Thu, Jan 12, 2017 at 04:37:13PM -0800, Andrei Vagin wrote:
> On Tue, Jan 10, 2017 at 7:07 PM, Krister Johansen
> <kjlx@templeofstupid.com> wrote:
> > On Wed, Jan 11, 2017 at 03:04:22PM +1300, Eric W. Biederman wrote:
> >> Any chance you have a trivial reproducer script?
> >>
> >> From you description I don't quite see the problem.  I know where to
> >> look but if could give a script that reproduces the conditions you
> >> see that would make it easier for me to dig into, and would certainly
> >> would remove ambiguity.   Ideally such a script would be runnable
> >> under unshare -Urm for easy repeated testing.
> >
> > My apologies.  I don't have something that fits into a shell script, but
> > I can walk you through the simplest test case that I used when I was
> > debugging this.
> >
> > Create net a ns:
> >
> >     $ sudo unshare -n bash
> >     # echo $$
> >     2771
> >
> > In another terminal bind mount that ns onto a file:
> >
> >     # mkdir /run/testns
> >     # touch /run/testns/ns1
> >     # mount --bind /proc/2771/ns/net /run/testns/ns1
> >
> > Back in first terminal, create a new ns, pivot root, and umount detach:
> >
> >     # exit
> >     $ unshare -U -m -n --propagation slave --map-root-user bash
> >     # mkdir binddir
> >     # mount --bind binddir binddir
> >     # cp busybox binddir
> >     # mkdir binddir/old_root
> >     # cd binddir
> >     # pivot_root . old_root
> >     # ./busybox umount -l old_root
> 
> Hi,
> 
> But this process still has mappings from "old_root"
> [root@fc24 busybox]# cat /proc/$$/maps
> 5607360f1000-5607361e9000 r-xp 00000000 fd:02 1176793
>   /usr/bin/bash
> 5607363e8000-5607363ec000 r--p 000f7000 fd:02 1176793
>   /usr/bin/bash
> 5607363ec000-5607363f5000 rw-p 000fb000 fd:02 1176793
>   /usr/bin/bash
> ...
> 
> You have to call "exec ./busybox sh" to release all "old_root" mounts.
> And in this case I see that a net namespace is destroyed:
> 
> [root@fc24 busybox]# cat /proc/slabinfo | /bin/grep net_name
> net_namespace          5      8   6784    4    8 : tunables    0    0
>   0 : slabdata      2      2      0
> [root@fc24 busybox]# exec /bin/sh
> / # cat /proc/slabinfo | /bin/grep -- net
> net_namespace          4      8   6784    4    8 : tunables    0    0
>   0 : slabdata      2      2      0

Thanks.  This seems to be the part of the puzzle that I was missing.  I
went back and looked and found that the container pid 1 did have live
memory mappings to files that are mounted on the old_root.  Appreciate
the nudge in the right direction.

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
  2017-01-12  8:26               ` Eric W. Biederman
@ 2017-01-13 23:28                   ` Krister Johansen
  -1 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-13 23:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Al Viro

On Thu, Jan 12, 2017 at 09:26:20PM +1300, Eric W. Biederman wrote:
> At which point you have an island of mounts.
> 
> In that island each submount is on it's parent's mnt_pin list.
> When the last reference of a parent is dropped we call
>     umount_mnt on the children from mntput_no_expire
>     drop_mountpoint from mnt_pin_kill from cleanup_mnt indirectly from mntput_no_expire
> 
> So all we need is mntput_no_expire on a mount to be called for the
> entire island to be freed.
> 
> So the fundamental issue appears to be that nothing is dropping the last
> reference to some part of your island of mounts.

Ok, got it.  With the pointer from Andrei I was able to go back and
verify that the container's init process had live memory mappings from
files that were on the old root.  I apologize for the false alarm, and I
do appreciate you taking the time to explain the particulars.

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible bug: detached mounts difficult to cleanup
@ 2017-01-13 23:28                   ` Krister Johansen
  0 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-13 23:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Krister Johansen, Al Viro, linux-fsdevel, containers

On Thu, Jan 12, 2017 at 09:26:20PM +1300, Eric W. Biederman wrote:
> At which point you have an island of mounts.
> 
> In that island each submount is on it's parent's mnt_pin list.
> When the last reference of a parent is dropped we call
>     umount_mnt on the children from mntput_no_expire
>     drop_mountpoint from mnt_pin_kill from cleanup_mnt indirectly from mntput_no_expire
> 
> So all we need is mntput_no_expire on a mount to be called for the
> entire island to be freed.
> 
> So the fundamental issue appears to be that nothing is dropping the last
> reference to some part of your island of mounts.

Ok, got it.  With the pointer from Andrei I was able to go back and
verify that the container's init process had live memory mappings from
files that were on the old root.  I apologize for the false alarm, and I
do appreciate you taking the time to explain the particulars.

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Possible bug: detached mounts difficult to cleanup
@ 2017-01-11  1:24 Krister Johansen
  0 siblings, 0 replies; 22+ messages in thread
From: Krister Johansen @ 2017-01-11  1:24 UTC (permalink / raw)
  To: Eric W. Biederman, Al Viro
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Gents,
This is the follow-up e-mail I referenced in our discussion about the
put_mountpoint locking problem.

The problem manifested itself as a situation where our container
provisioner would sometimes fail to re-start a container that it had
made configuration changes.  The IP address chosen by the provisioner
was still in use in another container.  This meant that the system had a
network namespace with an IP address that was still in use, despite the
provisoner having torn down the container as part of the reconfig
operation.

In order to keep the network namespace in use while the container is
alive, the software bind mounts the net and user namespaces out of
/proc/<pid>/ns/ into a directory that's used as the top level for the
container instance.

After forcing a crash dump and looking through the results, I was able
to confirm that the only reference keeping the net namespace alive was
the one held by the dentry on the mountpoint for the nsfs mount of the
network namespace.  The problem was that the container software had
unmounted this mountpoint, so it wasn't even in the host container's
mount namespace.

Since the software was using shared mounts, the nsfs bind mount was
getting copied into the mount namespaces of any container that was
created after the nsfs bind mount was established.  However, this isn't
obvious because each new namespace executes a pivot_root(2), followed by
an immediate and subsequent umount2(MNT_DETACH) on the old part of the
root filesystem that is no longer in use.  These mounts of the nsfs bind
mount weren't visibile in the kernel debugger, because they'd been
detached from the mount namespace's mount tree.

After looking at how iproute handles net namespaces, I ran a test where
every unmount of the net nsfs bind mount was followed by a rm of that
mountpoint.  That always resulted in the mountpoint getting freed and
the refcount on the dentry going to zero.  It wsa enough for to make
forward progress on the other tasks at hand.  I was able to verify that
the nsfs refcount was getting dropped, and we were going through the
__detach_mounts() cleanup path:

rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
            7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
            7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
            7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
            7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
            7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
            7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
            7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
            7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
            7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
            7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
            7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
            7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
            7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
            7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
            7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
            7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
            7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
                   e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)

Over the holiday, I had some more time to debug this and was able to
narrow it down to the following case.

1. The mount namespace that gets a copy of the nsfs bind mount must be
created in a different user namespace than the host container.  This
causes MNT_LOCKED to get set on the cloned mounts.

2. In the container, pivot_root(2) and then umount2(MNT_DETACH) the old
part of the tree from pivot_root.  Ensure that the nsfs mount is beneath
the root of this tree.

3. Umount the nsfs mount in the host container.  If the mount wasn't
locked in the other container, you'll see a kprobe on nsfs_evict trigger
immediately.  If it was MNT_LOCKED, then you'll need to rm the
mountpoint in the host to trigger the nsfs_evict.

For a nsfs mount, it's not particularly problematic to have to rm the
mount to clean it up, but the other mounts in the tree that are detached
and locked are often on mountpoints that can't be easily rm'd from the
host.  These are harder to clean up, and essentially orphaned until the
container's mount ns goes away.

It would be ideal if we could release these mounts sooner, but I'm
unsure of the best approach here.

Debugging further, I was able to see that:

a) The reason the nsfs isn't considere as part of propagate_mount_unlock
is that the 'mnt' passed to that function is the top of the mount tree
and it appears to only be considering mounts directly related to 'mnt'.

b) The change_mnt_propogation(MS_PRIVATE) at the end of the while loop
in umount_tree() is what ends up hiding these mounts from the host
container.  Once they're no longer slaved or shared, we never again
consider them as candiates for unlocking.

c) Also note that these detached mounts that aren't free'd aren't
charged against a container's ns->mounts limit, so it may be possible
for a mount ns to be using more mounts than it has officially accounted
for.

I wondered if a naive solution could re-walk the list of mounts
processed in umount_tree() and if all of the detached but locked mounts
had a refcount that indicated they're unused, they could be unlocked and
unmounted.  At least in the case of the containers I'm dealing with, the
the container software should be ensuring that nothing in the container
has a reference on anything that's under the detached portion of the
tree.  However, there's probably a better way to do this.

Thoughts?

-K

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-01-13 23:28 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-11  1:24 Possible bug: detached mounts difficult to cleanup Krister Johansen
2017-01-11  2:27 ` Eric W. Biederman
2017-01-11  2:51   ` Al Viro
     [not found]   ` <87fukqwcue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-11  2:37     ` Eric W. Biederman
2017-01-11  2:37       ` Eric W. Biederman
     [not found]       ` <87shoqtj7z.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-12  6:15         ` Krister Johansen
2017-01-12  6:15           ` Krister Johansen
     [not found]           ` <20170112061539.GA2345-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-12  8:26             ` Eric W. Biederman
2017-01-12  8:26               ` Eric W. Biederman
     [not found]               ` <87r348y98z.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-13 23:28                 ` Krister Johansen
2017-01-13 23:28                   ` Krister Johansen
2017-01-11  2:51     ` Al Viro
     [not found] ` <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-11  2:04   ` Eric W. Biederman
2017-01-11  2:04     ` Eric W. Biederman
     [not found]     ` <87r34a5p3t.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-11  3:07       ` Krister Johansen
2017-01-11  3:07         ` Krister Johansen
     [not found]         ` <20170111030753.GC2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-13  0:37           ` Andrei Vagin
2017-01-13  0:37             ` Andrei Vagin
     [not found]             ` <CANaxB-zMzS-euqR1_LvZSoEsO-Y6q=_qGNTJZCKZTL5WfFF16g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-13 23:28               ` Krister Johansen
2017-01-13 23:28             ` Krister Johansen
2017-01-11  2:27   ` Eric W. Biederman
  -- strict thread matches above, loose matches on Subject: below --
2017-01-11  1:24 Krister Johansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.