All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] vfs: Fix RCU usage in __propagate_umount()
@ 2014-07-30 13:59 Richard Weinberger
  2014-07-30 14:20 ` Richard Weinberger
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Richard Weinberger @ 2014-07-30 13:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, hch, paulmck, jeffm, linux-kernel, Richard Weinberger

If we use the plain list_empty() we might not see the
hlist_del_init_rcu() and therefore miss one member of the
list.

It fixes the following issue:
$ unshare -m /usr/bin/sleep 10000 &
$ mkdir -p foo/proc
$ mount -t proc none foo/proc
$ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
$ umount -l foo/proc
$ rmdir foo/proc
rmdir: failed to remove ‘foo/proc’: Device or resource busy

rmdir fails because the last entry in the RCU list, "proc", was
not propagated as list_empty() still returned false instead of true.

Signed-off-by: Richard Weinberger <richard@sigma-star.at>
---
Hi!

Please review this patch with care, the comments in rculist.h
confused me like hell:

First it says:
/*
 * Why is there no list_empty_rcu()?  Because list_empty() serves this
 * purpose.  The list_empty() function fetches the RCU-protected pointer
 * and compares it to the address of the list head, but neither dereferences
 * this pointer itself nor provides this pointer to the caller.  Therefore,
 * it is not necessary to use rcu_dereference(), so that list_empty() can
 * be used anywhere you would want to use a list_empty_rcu().
 */

And later:
/**
 * Where are list_empty_rcu() and list_first_entry_rcu()?
 *
 * Implementing those functions following their counterparts list_empty() and
 * list_first_entry() is not advisable because they lead to subtle race
 * conditions as the following snippet shows:
 *
 * if (!list_empty_rcu(mylist)) {
 *      struct foo *bar = list_first_entry_rcu(mylist, struct foo, list_member);
 *      do_something(bar);
 * }
 *
 * The list may not be empty when list_empty_rcu checks it, but it may be when
 * list_first_entry_rcu rereads the ->next pointer.
 *
 * Rereading the ->next pointer is not a problem for list_empty() and
 * list_first_entry() because they would be protected by a lock that blocks
 * writers.
 *
 * See list_first_or_null_rcu for an alternative.
 */

To my understanding we cannot use list_empty() and have to use list_first_or_null_rcu(),
or am I missing something?

Thanks,
//richard

 fs/pnode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 302bf22..883901c 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -380,7 +380,8 @@ static void __propagate_umount(struct mount *mnt)
 		 * umount the child only if the child has no
 		 * other children
 		 */
-		if (child && list_empty(&child->mnt_mounts)) {
+		if (child && list_first_or_null_rcu(&child->mnt_mounts,
+						    struct mount, mnt_mounts)) {
 			hlist_del_init_rcu(&child->mnt_hash);
 			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
 		}
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()
  2014-07-30 13:59 [PATCH] vfs: Fix RCU usage in __propagate_umount() Richard Weinberger
@ 2014-07-30 14:20 ` Richard Weinberger
  2014-07-30 18:39   ` Paul E. McKenney
  2014-07-30 20:46 ` MNT_DETACH and mount namespace issue (was: Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()) Richard Weinberger
  2 siblings, 0 replies; 16+ messages in thread
From: Richard Weinberger @ 2014-07-30 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, hch, paulmck, jeffm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3074 bytes --]

Am 30.07.2014 15:59, schrieb Richard Weinberger:
> If we use the plain list_empty() we might not see the
> hlist_del_init_rcu() and therefore miss one member of the
> list.
> 
> It fixes the following issue:
> $ unshare -m /usr/bin/sleep 10000 &
> $ mkdir -p foo/proc
> $ mount -t proc none foo/proc
> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
> $ umount -l foo/proc
> $ rmdir foo/proc
> rmdir: failed to remove ‘foo/proc’: Device or resource busy
> 
> rmdir fails because the last entry in the RCU list, "proc", was
> not propagated as list_empty() still returned false instead of true.
> 
> Signed-off-by: Richard Weinberger <richard@sigma-star.at>

Please drop this patch, it is wrong. :-\

Thanks,
//richard

> ---
> Hi!
> 
> Please review this patch with care, the comments in rculist.h
> confused me like hell:
> 
> First it says:
> /*
>  * Why is there no list_empty_rcu()?  Because list_empty() serves this
>  * purpose.  The list_empty() function fetches the RCU-protected pointer
>  * and compares it to the address of the list head, but neither dereferences
>  * this pointer itself nor provides this pointer to the caller.  Therefore,
>  * it is not necessary to use rcu_dereference(), so that list_empty() can
>  * be used anywhere you would want to use a list_empty_rcu().
>  */
> 
> And later:
> /**
>  * Where are list_empty_rcu() and list_first_entry_rcu()?
>  *
>  * Implementing those functions following their counterparts list_empty() and
>  * list_first_entry() is not advisable because they lead to subtle race
>  * conditions as the following snippet shows:
>  *
>  * if (!list_empty_rcu(mylist)) {
>  *      struct foo *bar = list_first_entry_rcu(mylist, struct foo, list_member);
>  *      do_something(bar);
>  * }
>  *
>  * The list may not be empty when list_empty_rcu checks it, but it may be when
>  * list_first_entry_rcu rereads the ->next pointer.
>  *
>  * Rereading the ->next pointer is not a problem for list_empty() and
>  * list_first_entry() because they would be protected by a lock that blocks
>  * writers.
>  *
>  * See list_first_or_null_rcu for an alternative.
>  */
> 
> To my understanding we cannot use list_empty() and have to use list_first_or_null_rcu(),
> or am I missing something?
> 
> Thanks,
> //richard
> 
>  fs/pnode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 302bf22..883901c 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -380,7 +380,8 @@ static void __propagate_umount(struct mount *mnt)
>  		 * umount the child only if the child has no
>  		 * other children
>  		 */
> -		if (child && list_empty(&child->mnt_mounts)) {
> +		if (child && list_first_or_null_rcu(&child->mnt_mounts,
> +						    struct mount, mnt_mounts)) {
>  			hlist_del_init_rcu(&child->mnt_hash);
>  			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>  		}
> 

-- 
sigma star gmbh - Bundesstrasse 3 - 6111 Volders - Austria
ATU66964118 - FN 374287y


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()
  2014-07-30 13:59 [PATCH] vfs: Fix RCU usage in __propagate_umount() Richard Weinberger
@ 2014-07-30 18:39   ` Paul E. McKenney
  2014-07-30 18:39   ` Paul E. McKenney
  2014-07-30 20:46 ` MNT_DETACH and mount namespace issue (was: Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()) Richard Weinberger
  2 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2014-07-30 18:39 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-fsdevel, viro, hch, jeffm, linux-kernel

On Wed, Jul 30, 2014 at 03:59:16PM +0200, Richard Weinberger wrote:
> If we use the plain list_empty() we might not see the
> hlist_del_init_rcu() and therefore miss one member of the
> list.
> 
> It fixes the following issue:
> $ unshare -m /usr/bin/sleep 10000 &
> $ mkdir -p foo/proc
> $ mount -t proc none foo/proc
> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
> $ umount -l foo/proc
> $ rmdir foo/proc
> rmdir: failed to remove ‘foo/proc’: Device or resource busy
> 
> rmdir fails because the last entry in the RCU list, "proc", was
> not propagated as list_empty() still returned false instead of true.
> 
> Signed-off-by: Richard Weinberger <richard@sigma-star.at>
> ---
> Hi!
> 
> Please review this patch with care, the comments in rculist.h
> confused me like hell:

So there are some special cases where list_empty() can be applied to
an RCU-protected list.  One such case is where only one task is
permitted to remove from the list (or, equivalently, where removal
is protected by a lock).  Then if that task sees !list_empty(), it
knows that the list will remain non-empty because no one else is
removing things.

There are other special cases, but this one gives the general flavor.

							Thanx, Paul

> First it says:
> /*
>  * Why is there no list_empty_rcu()?  Because list_empty() serves this
>  * purpose.  The list_empty() function fetches the RCU-protected pointer
>  * and compares it to the address of the list head, but neither dereferences
>  * this pointer itself nor provides this pointer to the caller.  Therefore,
>  * it is not necessary to use rcu_dereference(), so that list_empty() can
>  * be used anywhere you would want to use a list_empty_rcu().
>  */
> 
> And later:
> /**
>  * Where are list_empty_rcu() and list_first_entry_rcu()?
>  *
>  * Implementing those functions following their counterparts list_empty() and
>  * list_first_entry() is not advisable because they lead to subtle race
>  * conditions as the following snippet shows:
>  *
>  * if (!list_empty_rcu(mylist)) {
>  *      struct foo *bar = list_first_entry_rcu(mylist, struct foo, list_member);
>  *      do_something(bar);
>  * }
>  *
>  * The list may not be empty when list_empty_rcu checks it, but it may be when
>  * list_first_entry_rcu rereads the ->next pointer.
>  *
>  * Rereading the ->next pointer is not a problem for list_empty() and
>  * list_first_entry() because they would be protected by a lock that blocks
>  * writers.
>  *
>  * See list_first_or_null_rcu for an alternative.
>  */
> 
> To my understanding we cannot use list_empty() and have to use list_first_or_null_rcu(),
> or am I missing something?
> 
> Thanks,
> //richard
> 
>  fs/pnode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 302bf22..883901c 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -380,7 +380,8 @@ static void __propagate_umount(struct mount *mnt)
>  		 * umount the child only if the child has no
>  		 * other children
>  		 */
> -		if (child && list_empty(&child->mnt_mounts)) {
> +		if (child && list_first_or_null_rcu(&child->mnt_mounts,
> +						    struct mount, mnt_mounts)) {
>  			hlist_del_init_rcu(&child->mnt_hash);
>  			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>  		}
> -- 
> 1.8.4.5
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()
@ 2014-07-30 18:39   ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2014-07-30 18:39 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-fsdevel, viro, hch, jeffm, linux-kernel

On Wed, Jul 30, 2014 at 03:59:16PM +0200, Richard Weinberger wrote:
> If we use the plain list_empty() we might not see the
> hlist_del_init_rcu() and therefore miss one member of the
> list.
> 
> It fixes the following issue:
> $ unshare -m /usr/bin/sleep 10000 &
> $ mkdir -p foo/proc
> $ mount -t proc none foo/proc
> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
> $ umount -l foo/proc
> $ rmdir foo/proc
> rmdir: failed to remove ‘foo/proc’: Device or resource busy
> 
> rmdir fails because the last entry in the RCU list, "proc", was
> not propagated as list_empty() still returned false instead of true.
> 
> Signed-off-by: Richard Weinberger <richard@sigma-star.at>
> ---
> Hi!
> 
> Please review this patch with care, the comments in rculist.h
> confused me like hell:

So there are some special cases where list_empty() can be applied to
an RCU-protected list.  One such case is where only one task is
permitted to remove from the list (or, equivalently, where removal
is protected by a lock).  Then if that task sees !list_empty(), it
knows that the list will remain non-empty because no one else is
removing things.

There are other special cases, but this one gives the general flavor.

							Thanx, Paul

> First it says:
> /*
>  * Why is there no list_empty_rcu()?  Because list_empty() serves this
>  * purpose.  The list_empty() function fetches the RCU-protected pointer
>  * and compares it to the address of the list head, but neither dereferences
>  * this pointer itself nor provides this pointer to the caller.  Therefore,
>  * it is not necessary to use rcu_dereference(), so that list_empty() can
>  * be used anywhere you would want to use a list_empty_rcu().
>  */
> 
> And later:
> /**
>  * Where are list_empty_rcu() and list_first_entry_rcu()?
>  *
>  * Implementing those functions following their counterparts list_empty() and
>  * list_first_entry() is not advisable because they lead to subtle race
>  * conditions as the following snippet shows:
>  *
>  * if (!list_empty_rcu(mylist)) {
>  *      struct foo *bar = list_first_entry_rcu(mylist, struct foo, list_member);
>  *      do_something(bar);
>  * }
>  *
>  * The list may not be empty when list_empty_rcu checks it, but it may be when
>  * list_first_entry_rcu rereads the ->next pointer.
>  *
>  * Rereading the ->next pointer is not a problem for list_empty() and
>  * list_first_entry() because they would be protected by a lock that blocks
>  * writers.
>  *
>  * See list_first_or_null_rcu for an alternative.
>  */
> 
> To my understanding we cannot use list_empty() and have to use list_first_or_null_rcu(),
> or am I missing something?
> 
> Thanks,
> //richard
> 
>  fs/pnode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 302bf22..883901c 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -380,7 +380,8 @@ static void __propagate_umount(struct mount *mnt)
>  		 * umount the child only if the child has no
>  		 * other children
>  		 */
> -		if (child && list_empty(&child->mnt_mounts)) {
> +		if (child && list_first_or_null_rcu(&child->mnt_mounts,
> +						    struct mount, mnt_mounts)) {
>  			hlist_del_init_rcu(&child->mnt_hash);
>  			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>  		}
> -- 
> 1.8.4.5
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* MNT_DETACH and mount namespace issue (was: Re: [PATCH] vfs: Fix RCU usage in __propagate_umount())
  2014-07-30 13:59 [PATCH] vfs: Fix RCU usage in __propagate_umount() Richard Weinberger
  2014-07-30 14:20 ` Richard Weinberger
  2014-07-30 18:39   ` Paul E. McKenney
@ 2014-07-30 20:46 ` Richard Weinberger
  2014-07-31 22:17   ` MNT_DETACH and mount namespace issue Richard Weinberger
  2 siblings, 1 reply; 16+ messages in thread
From: Richard Weinberger @ 2014-07-30 20:46 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, hch, paulmck, jeffm, sahne, linux-kernel

Am 30.07.2014 15:59, schrieb Richard Weinberger:
> If we use the plain list_empty() we might not see the
> hlist_del_init_rcu() and therefore miss one member of the
> list.
> 
> It fixes the following issue:
> $ unshare -m /usr/bin/sleep 10000 &
> $ mkdir -p foo/proc
> $ mount -t proc none foo/proc
> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
> $ umount -l foo/proc
> $ rmdir foo/proc
> rmdir: failed to remove ‘foo/proc’: Device or resource busy

Although my fix was wrong, the issue is real, it seems to exist for a very long
time. Just was able to reproduce it on 2.6.32.
Please note that you need a shared root subtree to trigger the issue.
i.e. mount --shared /
Maybe this is why nobody noticed it so far as only systemd distros
have the root subtree shared by default.

I hit the issue on openSUSE 13.1 where an application creates a chroot environment
and then lazy umounts /proc.
It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
were affected. This did not make any sense until I discovered that the OpenVPN systemd
service file has set "PrivateTmp=true". This setting creates
a mount namespace for the said service...

In __propagate_umount() the following piece of code is interesting:

 /*
 * umount the child only if the child has no
 * other children
 */
if (child && list_empty(&child->mnt_mounts)) {
        hlist_del_init_rcu(&child->mnt_hash);
        hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
}

child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
subtree was removed.
I'm not sure whether this is only one more symptom or the main culprit.

Any ideas?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-07-30 20:46 ` MNT_DETACH and mount namespace issue (was: Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()) Richard Weinberger
@ 2014-07-31 22:17   ` Richard Weinberger
  2014-08-01 15:44     ` Ram Pai
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Weinberger @ 2014-07-31 22:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, hch, paulmck, jeffm, sahne, linux-kernel, linuxram

Am 30.07.2014 22:46, schrieb Richard Weinberger:
> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>> If we use the plain list_empty() we might not see the
>> hlist_del_init_rcu() and therefore miss one member of the
>> list.
>>
>> It fixes the following issue:
>> $ unshare -m /usr/bin/sleep 10000 &
>> $ mkdir -p foo/proc
>> $ mount -t proc none foo/proc
>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>> $ umount -l foo/proc
>> $ rmdir foo/proc
>> rmdir: failed to remove ‘foo/proc’: Device or resource busy
> 
> Although my fix was wrong, the issue is real, it seems to exist for a very long
> time. Just was able to reproduce it on 2.6.32.
> Please note that you need a shared root subtree to trigger the issue.
> i.e. mount --shared /
> Maybe this is why nobody noticed it so far as only systemd distros
> have the root subtree shared by default.
> 
> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
> and then lazy umounts /proc.
> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
> were affected. This did not make any sense until I discovered that the OpenVPN systemd
> service file has set "PrivateTmp=true". This setting creates
> a mount namespace for the said service...
> 
> In __propagate_umount() the following piece of code is interesting:
> 
>  /*
>  * umount the child only if the child has no
>  * other children
>  */
> if (child && list_empty(&child->mnt_mounts)) {
>         hlist_del_init_rcu(&child->mnt_hash);
>         hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
> }
> 
> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
> subtree was removed.
> I'm not sure whether this is only one more symptom or the main culprit.

CC'ing Ram Pai.

Ram, you are the author of the said code. Can you please explain why we need that
list_empty() check?
To my (limited) understanding of VFS, the following change should be fine to fix the issue:

diff --git a/fs/pnode.c b/fs/pnode.c
index 302bf22..627b35f 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -376,11 +376,7 @@ static void __propagate_umount(struct mount *mnt)

 		struct mount *child = __lookup_mnt_last(&m->mnt,
 						mnt->mnt_mountpoint);
-		/*
-		 * umount the child only if the child has no
-		 * other children
-		 */
-		if (child && list_empty(&child->mnt_mounts)) {
+		if (child) {
 			hlist_del_init_rcu(&child->mnt_hash);
 			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
 		}

Thanks,
//richard

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-07-31 22:17   ` MNT_DETACH and mount namespace issue Richard Weinberger
@ 2014-08-01 15:44     ` Ram Pai
  2014-08-01 19:20         ` Richard Weinberger
  0 siblings, 1 reply; 16+ messages in thread
From: Ram Pai @ 2014-08-01 15:44 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: linux-fsdevel, viro, hch, paulmck, jeffm, sahne, linux-kernel

On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote:
> Am 30.07.2014 22:46, schrieb Richard Weinberger:
> > Am 30.07.2014 15:59, schrieb Richard Weinberger:
> >> If we use the plain list_empty() we might not see the
> >> hlist_del_init_rcu() and therefore miss one member of the
> >> list.
> >>
> >> It fixes the following issue:
> >> $ unshare -m /usr/bin/sleep 10000 &
> >> $ mkdir -p foo/proc
> >> $ mount -t proc none foo/proc
> >> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
> >> $ umount -l foo/proc
> >> $ rmdir foo/proc
> >> rmdir: failed to remove ‘foo/proc’: Device or resource busy
> > 
> > Although my fix was wrong, the issue is real, it seems to exist for a very long
> > time. Just was able to reproduce it on 2.6.32.
> > Please note that you need a shared root subtree to trigger the issue.
> > i.e. mount --shared /
> > Maybe this is why nobody noticed it so far as only systemd distros
> > have the root subtree shared by default.
> > 
> > I hit the issue on openSUSE 13.1 where an application creates a chroot environment
> > and then lazy umounts /proc.
> > It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
> > were affected. This did not make any sense until I discovered that the OpenVPN systemd
> > service file has set "PrivateTmp=true". This setting creates
> > a mount namespace for the said service...
> > 
> > In __propagate_umount() the following piece of code is interesting:
> > 
> >  /*
> >  * umount the child only if the child has no
> >  * other children
> >  */
> > if (child && list_empty(&child->mnt_mounts)) {
> >         hlist_del_init_rcu(&child->mnt_hash);
> >         hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
> > }
> > 
> > child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
> > subtree was removed.
> > I'm not sure whether this is only one more symptom or the main culprit.
> 
> CC'ing Ram Pai.
> 
> Ram, you are the author of the said code. Can you please explain why we need that
> list_empty() check?
> To my (limited) understanding of VFS, the following change should be fine to fix the issue:

We had made a rule then, that busy vfsmounts cannot be lazily unmounted
**implicitly**. Propagated unmounts are implicit unmounts, and if such
implicit vfsmounts have child-mounts than obviously they are busy, and
hence they cannot be lazy-unmounted implicitly.

the list_empty() is checking for no child-mounts on the vfsmount before
letting it unmount.

We did not want a bunch of mounts disappear without the users knowledge.
Hence we made the above rule.

Al Viro, will have more insights into this.

RP


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-08-01 15:44     ` Ram Pai
@ 2014-08-01 19:20         ` Richard Weinberger
  0 siblings, 0 replies; 16+ messages in thread
From: Richard Weinberger @ 2014-08-01 19:20 UTC (permalink / raw)
  To: Ram Pai; +Cc: linux-fsdevel, viro, hch, paulmck, jeffm, sahne, linux-kernel

Am 01.08.2014 17:44, schrieb Ram Pai:
> On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote:
>> Am 30.07.2014 22:46, schrieb Richard Weinberger:
>>> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>>>> If we use the plain list_empty() we might not see the
>>>> hlist_del_init_rcu() and therefore miss one member of the
>>>> list.
>>>>
>>>> It fixes the following issue:
>>>> $ unshare -m /usr/bin/sleep 10000 &
>>>> $ mkdir -p foo/proc
>>>> $ mount -t proc none foo/proc
>>>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>>>> $ umount -l foo/proc
>>>> $ rmdir foo/proc
>>>> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>>>
>>> Although my fix was wrong, the issue is real, it seems to exist for a very long
>>> time. Just was able to reproduce it on 2.6.32.
>>> Please note that you need a shared root subtree to trigger the issue.
>>> i.e. mount --shared /
>>> Maybe this is why nobody noticed it so far as only systemd distros
>>> have the root subtree shared by default.
>>>
>>> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
>>> and then lazy umounts /proc.
>>> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
>>> were affected. This did not make any sense until I discovered that the OpenVPN systemd
>>> service file has set "PrivateTmp=true". This setting creates
>>> a mount namespace for the said service...
>>>
>>> In __propagate_umount() the following piece of code is interesting:
>>>
>>>  /*
>>>  * umount the child only if the child has no
>>>  * other children
>>>  */
>>> if (child && list_empty(&child->mnt_mounts)) {
>>>         hlist_del_init_rcu(&child->mnt_hash);
>>>         hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>>> }
>>>
>>> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
>>> subtree was removed.
>>> I'm not sure whether this is only one more symptom or the main culprit.
>>
>> CC'ing Ram Pai.
>>
>> Ram, you are the author of the said code. Can you please explain why we need that
>> list_empty() check?
>> To my (limited) understanding of VFS, the following change should be fine to fix the issue:
> 
> We had made a rule then, that busy vfsmounts cannot be lazily unmounted
> **implicitly**. Propagated unmounts are implicit unmounts, and if such
> implicit vfsmounts have child-mounts than obviously they are busy, and
> hence they cannot be lazy-unmounted implicitly.
> 
> the list_empty() is checking for no child-mounts on the vfsmount before
> letting it unmount.
> 
> We did not want a bunch of mounts disappear without the users knowledge.
> Hence we made the above rule.
> 
> Al Viro, will have more insights into this.

Hmm, with the root subtree shared by default this policy will be problematic and
lead to problems.
As I observe on openSUSE 13.1.

Al, what do you think?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
@ 2014-08-01 19:20         ` Richard Weinberger
  0 siblings, 0 replies; 16+ messages in thread
From: Richard Weinberger @ 2014-08-01 19:20 UTC (permalink / raw)
  To: Ram Pai; +Cc: linux-fsdevel, viro, hch, paulmck, jeffm, sahne, linux-kernel

Am 01.08.2014 17:44, schrieb Ram Pai:
> On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote:
>> Am 30.07.2014 22:46, schrieb Richard Weinberger:
>>> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>>>> If we use the plain list_empty() we might not see the
>>>> hlist_del_init_rcu() and therefore miss one member of the
>>>> list.
>>>>
>>>> It fixes the following issue:
>>>> $ unshare -m /usr/bin/sleep 10000 &
>>>> $ mkdir -p foo/proc
>>>> $ mount -t proc none foo/proc
>>>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>>>> $ umount -l foo/proc
>>>> $ rmdir foo/proc
>>>> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>>>
>>> Although my fix was wrong, the issue is real, it seems to exist for a very long
>>> time. Just was able to reproduce it on 2.6.32.
>>> Please note that you need a shared root subtree to trigger the issue.
>>> i.e. mount --shared /
>>> Maybe this is why nobody noticed it so far as only systemd distros
>>> have the root subtree shared by default.
>>>
>>> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
>>> and then lazy umounts /proc.
>>> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
>>> were affected. This did not make any sense until I discovered that the OpenVPN systemd
>>> service file has set "PrivateTmp=true". This setting creates
>>> a mount namespace for the said service...
>>>
>>> In __propagate_umount() the following piece of code is interesting:
>>>
>>>  /*
>>>  * umount the child only if the child has no
>>>  * other children
>>>  */
>>> if (child && list_empty(&child->mnt_mounts)) {
>>>         hlist_del_init_rcu(&child->mnt_hash);
>>>         hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>>> }
>>>
>>> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
>>> subtree was removed.
>>> I'm not sure whether this is only one more symptom or the main culprit.
>>
>> CC'ing Ram Pai.
>>
>> Ram, you are the author of the said code. Can you please explain why we need that
>> list_empty() check?
>> To my (limited) understanding of VFS, the following change should be fine to fix the issue:
> 
> We had made a rule then, that busy vfsmounts cannot be lazily unmounted
> **implicitly**. Propagated unmounts are implicit unmounts, and if such
> implicit vfsmounts have child-mounts than obviously they are busy, and
> hence they cannot be lazy-unmounted implicitly.
> 
> the list_empty() is checking for no child-mounts on the vfsmount before
> letting it unmount.
> 
> We did not want a bunch of mounts disappear without the users knowledge.
> Hence we made the above rule.
> 
> Al Viro, will have more insights into this.

Hmm, with the root subtree shared by default this policy will be problematic and
lead to problems.
As I observe on openSUSE 13.1.

Al, what do you think?

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-08-01 19:20         ` Richard Weinberger
  (?)
@ 2014-08-01 22:09         ` Eric W. Biederman
  2014-08-04  8:40           ` Richard Weinberger
  -1 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2014-08-01 22:09 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Ram Pai, linux-fsdevel, viro, hch, paulmck, jeffm, sahne, linux-kernel

Richard Weinberger <richard@nod.at> writes:

> Am 01.08.2014 17:44, schrieb Ram Pai:
>> On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote:
>>> Am 30.07.2014 22:46, schrieb Richard Weinberger:
>>>> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>>>>> If we use the plain list_empty() we might not see the
>>>>> hlist_del_init_rcu() and therefore miss one member of the
>>>>> list.
>>>>>
>>>>> It fixes the following issue:
>>>>> $ unshare -m /usr/bin/sleep 10000 &
>>>>> $ mkdir -p foo/proc
>>>>> $ mount -t proc none foo/proc
>>>>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>>>>> $ umount -l foo/proc
>>>>> $ rmdir foo/proc
>>>>> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>>>>
>>>> Although my fix was wrong, the issue is real, it seems to exist for a very long
>>>> time. Just was able to reproduce it on 2.6.32.
>>>> Please note that you need a shared root subtree to trigger the issue.
>>>> i.e. mount --shared /
>>>> Maybe this is why nobody noticed it so far as only systemd distros
>>>> have the root subtree shared by default.
>>>>
>>>> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
>>>> and then lazy umounts /proc.
>>>> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
>>>> were affected. This did not make any sense until I discovered that the OpenVPN systemd
>>>> service file has set "PrivateTmp=true". This setting creates
>>>> a mount namespace for the said service...
>>>>
>>>> In __propagate_umount() the following piece of code is interesting:
>>>>
>>>>  /*
>>>>  * umount the child only if the child has no
>>>>  * other children
>>>>  */
>>>> if (child && list_empty(&child->mnt_mounts)) {
>>>>         hlist_del_init_rcu(&child->mnt_hash);
>>>>         hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>>>> }
>>>>
>>>> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
>>>> subtree was removed.
>>>> I'm not sure whether this is only one more symptom or the main culprit.
>>>
>>> CC'ing Ram Pai.
>>>
>>> Ram, you are the author of the said code. Can you please explain why we need that
>>> list_empty() check?
>>> To my (limited) understanding of VFS, the following change should be fine to fix the issue:
>> 
>> We had made a rule then, that busy vfsmounts cannot be lazily unmounted
>> **implicitly**. Propagated unmounts are implicit unmounts, and if such
>> implicit vfsmounts have child-mounts than obviously they are busy, and
>> hence they cannot be lazy-unmounted implicitly.
>> 
>> the list_empty() is checking for no child-mounts on the vfsmount before
>> letting it unmount.
>> 
>> We did not want a bunch of mounts disappear without the users knowledge.
>> Hence we made the above rule.
>> 
>> Al Viro, will have more insights into this.
>
> Hmm, with the root subtree shared by default this policy will be problematic and
> lead to problems.
> As I observe on openSUSE 13.1.
>
> Al, what do you think?

I have a pending patchset that causes the rmdir to cause all of the
mounts to go away.  It has passed review and has not been merged only
because of stack overflow concerns (which I have not had time to fully
address).

Sigh.  It badly breaks unix semantics for rmdir unlink with no mounts in
the local namespace to fail, and it introduces as denial of service
attack from unprivielged users.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-08-01 22:09         ` Eric W. Biederman
@ 2014-08-04  8:40           ` Richard Weinberger
  2014-08-04 16:46             ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Weinberger @ 2014-08-04  8:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Richard Weinberger, Ram Pai, linux-fsdevel, Al Viro,
	Christoph Hellwig, Paul McKenney, Jeff Mahoney, sahne,
	linux-kernel

On Sat, Aug 2, 2014 at 12:09 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Richard Weinberger <richard@nod.at> writes:
>
>> Am 01.08.2014 17:44, schrieb Ram Pai:
>>> On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote:
>>>> Am 30.07.2014 22:46, schrieb Richard Weinberger:
>>>>> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>>>>>> If we use the plain list_empty() we might not see the
>>>>>> hlist_del_init_rcu() and therefore miss one member of the
>>>>>> list.
>>>>>>
>>>>>> It fixes the following issue:
>>>>>> $ unshare -m /usr/bin/sleep 10000 &
>>>>>> $ mkdir -p foo/proc
>>>>>> $ mount -t proc none foo/proc
>>>>>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>>>>>> $ umount -l foo/proc
>>>>>> $ rmdir foo/proc
>>>>>> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>>>>>
>>>>> Although my fix was wrong, the issue is real, it seems to exist for a very long
>>>>> time. Just was able to reproduce it on 2.6.32.
>>>>> Please note that you need a shared root subtree to trigger the issue.
>>>>> i.e. mount --shared /
>>>>> Maybe this is why nobody noticed it so far as only systemd distros
>>>>> have the root subtree shared by default.
>>>>>
>>>>> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
>>>>> and then lazy umounts /proc.
>>>>> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
>>>>> were affected. This did not make any sense until I discovered that the OpenVPN systemd
>>>>> service file has set "PrivateTmp=true". This setting creates
>>>>> a mount namespace for the said service...
>>>>>
>>>>> In __propagate_umount() the following piece of code is interesting:
>>>>>
>>>>>  /*
>>>>>  * umount the child only if the child has no
>>>>>  * other children
>>>>>  */
>>>>> if (child && list_empty(&child->mnt_mounts)) {
>>>>>         hlist_del_init_rcu(&child->mnt_hash);
>>>>>         hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>>>>> }
>>>>>
>>>>> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
>>>>> subtree was removed.
>>>>> I'm not sure whether this is only one more symptom or the main culprit.
>>>>
>>>> CC'ing Ram Pai.
>>>>
>>>> Ram, you are the author of the said code. Can you please explain why we need that
>>>> list_empty() check?
>>>> To my (limited) understanding of VFS, the following change should be fine to fix the issue:
>>>
>>> We had made a rule then, that busy vfsmounts cannot be lazily unmounted
>>> **implicitly**. Propagated unmounts are implicit unmounts, and if such
>>> implicit vfsmounts have child-mounts than obviously they are busy, and
>>> hence they cannot be lazy-unmounted implicitly.
>>>
>>> the list_empty() is checking for no child-mounts on the vfsmount before
>>> letting it unmount.
>>>
>>> We did not want a bunch of mounts disappear without the users knowledge.
>>> Hence we made the above rule.
>>>
>>> Al Viro, will have more insights into this.
>>
>> Hmm, with the root subtree shared by default this policy will be problematic and
>> lead to problems.
>> As I observe on openSUSE 13.1.
>>
>> Al, what do you think?
>
> I have a pending patchset that causes the rmdir to cause all of the
> mounts to go away.  It has passed review and has not been merged only
> because of stack overflow concerns (which I have not had time to fully
> address).
>
> Sigh.  It badly breaks unix semantics for rmdir unlink with no mounts in
> the local namespace to fail, and it introduces as denial of service
> attack from unprivielged users.

Thanks for the pointer!
I fear your patch series is nothing we can easily feed into -stable.
Is this really the only acceptable solution?

The thing is, users get already bitten by that.
i.e. run openSUSE's KIWI tool on a machine where a systemd service
with PrivateTmp=yes is installed
and you'll end up with a stale mount point.
KIWI creates a chroot, populates /proc, lazily unmounts it later and
then fails to remove the temporary chroot directory
because of EBUSY.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-08-04  8:40           ` Richard Weinberger
@ 2014-08-04 16:46             ` Eric W. Biederman
  2014-08-04 21:19               ` Richard Weinberger
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2014-08-04 16:46 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Richard Weinberger, Ram Pai, linux-fsdevel, Al Viro,
	Christoph Hellwig, Paul McKenney, Jeff Mahoney, sahne,
	linux-kernel

Richard Weinberger <richard.weinberger@gmail.com> writes:

> On Sat, Aug 2, 2014 at 12:09 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Richard Weinberger <richard@nod.at> writes:
>>
>>> Am 01.08.2014 17:44, schrieb Ram Pai:
>>>> On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote:
>>>>> Am 30.07.2014 22:46, schrieb Richard Weinberger:
>>>>>> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>>>>>>> If we use the plain list_empty() we might not see the
>>>>>>> hlist_del_init_rcu() and therefore miss one member of the
>>>>>>> list.
>>>>>>>
>>>>>>> It fixes the following issue:
>>>>>>> $ unshare -m /usr/bin/sleep 10000 &
>>>>>>> $ mkdir -p foo/proc
>>>>>>> $ mount -t proc none foo/proc
>>>>>>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>>>>>>> $ umount -l foo/proc
>>>>>>> $ rmdir foo/proc
>>>>>>> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>>>>>>
>>>>>> Although my fix was wrong, the issue is real, it seems to exist for a very long
>>>>>> time. Just was able to reproduce it on 2.6.32.
>>>>>> Please note that you need a shared root subtree to trigger the issue.
>>>>>> i.e. mount --shared /
>>>>>> Maybe this is why nobody noticed it so far as only systemd distros
>>>>>> have the root subtree shared by default.
>>>>>>
>>>>>> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
>>>>>> and then lazy umounts /proc.
>>>>>> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
>>>>>> were affected. This did not make any sense until I discovered that the OpenVPN systemd
>>>>>> service file has set "PrivateTmp=true". This setting creates
>>>>>> a mount namespace for the said service...
>>>>>>
>>>>>> In __propagate_umount() the following piece of code is interesting:
>>>>>>
>>>>>>  /*
>>>>>>  * umount the child only if the child has no
>>>>>>  * other children
>>>>>>  */
>>>>>> if (child && list_empty(&child->mnt_mounts)) {
>>>>>>         hlist_del_init_rcu(&child->mnt_hash);
>>>>>>         hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>>>>>> }
>>>>>>
>>>>>> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
>>>>>> subtree was removed.
>>>>>> I'm not sure whether this is only one more symptom or the main culprit.
>>>>>
>>>>> CC'ing Ram Pai.
>>>>>
>>>>> Ram, you are the author of the said code. Can you please explain why we need that
>>>>> list_empty() check?
>>>>> To my (limited) understanding of VFS, the following change should be fine to fix the issue:
>>>>
>>>> We had made a rule then, that busy vfsmounts cannot be lazily unmounted
>>>> **implicitly**. Propagated unmounts are implicit unmounts, and if such
>>>> implicit vfsmounts have child-mounts than obviously they are busy, and
>>>> hence they cannot be lazy-unmounted implicitly.
>>>>
>>>> the list_empty() is checking for no child-mounts on the vfsmount before
>>>> letting it unmount.
>>>>
>>>> We did not want a bunch of mounts disappear without the users knowledge.
>>>> Hence we made the above rule.
>>>>
>>>> Al Viro, will have more insights into this.
>>>
>>> Hmm, with the root subtree shared by default this policy will be problematic and
>>> lead to problems.
>>> As I observe on openSUSE 13.1.
>>>
>>> Al, what do you think?
>>
>> I have a pending patchset that causes the rmdir to cause all of the
>> mounts to go away.  It has passed review and has not been merged only
>> because of stack overflow concerns (which I have not had time to fully
>> address).
>>
>> Sigh.  It badly breaks unix semantics for rmdir unlink with no mounts in
>> the local namespace to fail, and it introduces as denial of service
>> attack from unprivielged users.
>
> Thanks for the pointer!
> I fear your patch series is nothing we can easily feed into -stable.
> Is this really the only acceptable solution?

I am not really certain.  What I know is one of these days I need to
take the time and get that merged.  It isn't going to be in 3.17
unfortunately :(

What I do know is if you are asking questions about sane semantics my
patch and the approach it takes feeds in.

It sounds like your problem is with lazy recursive unmounts not being
propogated because there is a chance that in the destination there might
be something mounted on top.  

> The thing is, users get already bitten by that.

The issue I am dealing with has been biting users for years.
as root rm -rf dir failing with EBUSY because of a stale process in
the mount namespace is pretty horrid.

> i.e. run openSUSE's KIWI tool on a machine where a systemd service
> with PrivateTmp=yes is installed
> and you'll end up with a stale mount point.
> KIWI creates a chroot, populates /proc, lazily unmounts it later and
> then fails to remove the temporary chroot directory
> because of EBUSY.

Hmm.  That problem does sound familiar.

Is the problem oversharing?  Is the problem that the mount of /proc in
the chroot directory is propogating into other mount namespaces that
don't care?

If the problem is over propogating I would argue that KIWI needs to
use a mount namespace instead of a chroot and to tweak the mount
namespace so the mount of /proc simply does not leak out.

Not that the kernel doesn't need to be fixed but that a design where
mounts propogate everywhere is a problem in userspace anyway, and it is
likely going to only need to be a handful of lines of code to fix
userspace cleanly.  While the kernel fix or fixes are going to require a
bit more time.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-08-04 16:46             ` Eric W. Biederman
@ 2014-08-04 21:19               ` Richard Weinberger
  2014-08-04 22:10                 ` Ram Pai
  2014-08-04 22:59                 ` Eric W. Biederman
  0 siblings, 2 replies; 16+ messages in thread
From: Richard Weinberger @ 2014-08-04 21:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ram Pai, linux-fsdevel, Al Viro, Christoph Hellwig,
	Paul McKenney, Jeff Mahoney, sahne, linux-kernel

Am 04.08.2014 18:46, schrieb Eric W. Biederman:
> Richard Weinberger <richard.weinberger@gmail.com> writes:
> 
>> On Sat, Aug 2, 2014 at 12:09 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Richard Weinberger <richard@nod.at> writes:
>>>
>>>> Am 01.08.2014 17:44, schrieb Ram Pai:
>>>>> On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote:
>>>>>> Am 30.07.2014 22:46, schrieb Richard Weinberger:
>>>>>>> Am 30.07.2014 15:59, schrieb Richard Weinberger:
>>>>>>>> If we use the plain list_empty() we might not see the
>>>>>>>> hlist_del_init_rcu() and therefore miss one member of the
>>>>>>>> list.
>>>>>>>>
>>>>>>>> It fixes the following issue:
>>>>>>>> $ unshare -m /usr/bin/sleep 10000 &
>>>>>>>> $ mkdir -p foo/proc
>>>>>>>> $ mount -t proc none foo/proc
>>>>>>>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc
>>>>>>>> $ umount -l foo/proc
>>>>>>>> $ rmdir foo/proc
>>>>>>>> rmdir: failed to remove ‘foo/proc’: Device or resource busy
>>>>>>>
>>>>>>> Although my fix was wrong, the issue is real, it seems to exist for a very long
>>>>>>> time. Just was able to reproduce it on 2.6.32.
>>>>>>> Please note that you need a shared root subtree to trigger the issue.
>>>>>>> i.e. mount --shared /
>>>>>>> Maybe this is why nobody noticed it so far as only systemd distros
>>>>>>> have the root subtree shared by default.
>>>>>>>
>>>>>>> I hit the issue on openSUSE 13.1 where an application creates a chroot environment
>>>>>>> and then lazy umounts /proc.
>>>>>>> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel
>>>>>>> were affected. This did not make any sense until I discovered that the OpenVPN systemd
>>>>>>> service file has set "PrivateTmp=true". This setting creates
>>>>>>> a mount namespace for the said service...
>>>>>>>
>>>>>>> In __propagate_umount() the following piece of code is interesting:
>>>>>>>
>>>>>>>  /*
>>>>>>>  * umount the child only if the child has no
>>>>>>>  * other children
>>>>>>>  */
>>>>>>> if (child && list_empty(&child->mnt_mounts)) {
>>>>>>>         hlist_del_init_rcu(&child->mnt_hash);
>>>>>>>         hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
>>>>>>> }
>>>>>>>
>>>>>>> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc"
>>>>>>> subtree was removed.
>>>>>>> I'm not sure whether this is only one more symptom or the main culprit.
>>>>>>
>>>>>> CC'ing Ram Pai.
>>>>>>
>>>>>> Ram, you are the author of the said code. Can you please explain why we need that
>>>>>> list_empty() check?
>>>>>> To my (limited) understanding of VFS, the following change should be fine to fix the issue:
>>>>>
>>>>> We had made a rule then, that busy vfsmounts cannot be lazily unmounted
>>>>> **implicitly**. Propagated unmounts are implicit unmounts, and if such
>>>>> implicit vfsmounts have child-mounts than obviously they are busy, and
>>>>> hence they cannot be lazy-unmounted implicitly.
>>>>>
>>>>> the list_empty() is checking for no child-mounts on the vfsmount before
>>>>> letting it unmount.
>>>>>
>>>>> We did not want a bunch of mounts disappear without the users knowledge.
>>>>> Hence we made the above rule.
>>>>>
>>>>> Al Viro, will have more insights into this.
>>>>
>>>> Hmm, with the root subtree shared by default this policy will be problematic and
>>>> lead to problems.
>>>> As I observe on openSUSE 13.1.
>>>>
>>>> Al, what do you think?
>>>
>>> I have a pending patchset that causes the rmdir to cause all of the
>>> mounts to go away.  It has passed review and has not been merged only
>>> because of stack overflow concerns (which I have not had time to fully
>>> address).
>>>
>>> Sigh.  It badly breaks unix semantics for rmdir unlink with no mounts in
>>> the local namespace to fail, and it introduces as denial of service
>>> attack from unprivielged users.
>>
>> Thanks for the pointer!
>> I fear your patch series is nothing we can easily feed into -stable.
>> Is this really the only acceptable solution?
> 
> I am not really certain.  What I know is one of these days I need to
> take the time and get that merged.  It isn't going to be in 3.17
> unfortunately :(
> 
> What I do know is if you are asking questions about sane semantics my
> patch and the approach it takes feeds in.
> 
> It sounds like your problem is with lazy recursive unmounts not being
> propogated because there is a chance that in the destination there might
> be something mounted on top.  
> 
>> The thing is, users get already bitten by that.
> 
> The issue I am dealing with has been biting users for years.
> as root rm -rf dir failing with EBUSY because of a stale process in
> the mount namespace is pretty horrid.
> 
>> i.e. run openSUSE's KIWI tool on a machine where a systemd service
>> with PrivateTmp=yes is installed
>> and you'll end up with a stale mount point.
>> KIWI creates a chroot, populates /proc, lazily unmounts it later and
>> then fails to remove the temporary chroot directory
>> because of EBUSY.
> 
> Hmm.  That problem does sound familiar.
> 
> Is the problem oversharing?  Is the problem that the mount of /proc in
> the chroot directory is propogating into other mount namespaces that
> don't care?

/proc is propagating into another mount namespaces that does not care.
This happens because systemd creates for several services a mount namespace and sets
the root tree to MS_SHARED.

> If the problem is over propogating I would argue that KIWI needs to
> use a mount namespace instead of a chroot and to tweak the mount
> namespace so the mount of /proc simply does not leak out.
> 
> Not that the kernel doesn't need to be fixed but that a design where
> mounts propogate everywhere is a problem in userspace anyway, and it is
> likely going to only need to be a handful of lines of code to fix
> userspace cleanly.  While the kernel fix or fixes are going to require a
> bit more time.

KIWI can bypass the issue by not using a lazy unmount of /proc.
But I fear more applications will need fixing.
While I don't think that it was a wise choice from systemd developers to set
/ shared by default I agree that systemd is not the root cause of the problem.
It is the messenger.

It is just annoying that applications stopped working correctly after upgrading
to systemd.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-08-04 21:19               ` Richard Weinberger
@ 2014-08-04 22:10                 ` Ram Pai
  2014-08-04 22:16                   ` Richard Weinberger
  2014-08-04 22:59                 ` Eric W. Biederman
  1 sibling, 1 reply; 16+ messages in thread
From: Ram Pai @ 2014-08-04 22:10 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Eric W. Biederman, linux-fsdevel, Al Viro, Christoph Hellwig,
	Paul McKenney, Jeff Mahoney, sahne, linux-kernel

On Mon, Aug 04, 2014 at 11:19:35PM +0200, Richard Weinberger wrote:
> Am 04.08.2014 18:46, schrieb Eric W. Biederman:
> > Richard Weinberger <richard.weinberger@gmail.com> writes:
> 
> /proc is propagating into another mount namespaces that does not care.
> This happens because systemd creates for several services a mount namespace and sets
> the root tree to MS_SHARED.

if propagations are not needed, than set the root of the new mount
namespace to MS_PRIVATE first and then set it to MS_SHARED.

MS_PRIVATE will delink the propagations, and MS_SHARED later will enable
the new mounts to propagate to whoever wants them.

RP


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-08-04 22:10                 ` Ram Pai
@ 2014-08-04 22:16                   ` Richard Weinberger
  0 siblings, 0 replies; 16+ messages in thread
From: Richard Weinberger @ 2014-08-04 22:16 UTC (permalink / raw)
  To: Ram Pai
  Cc: Eric W. Biederman, linux-fsdevel, Al Viro, Christoph Hellwig,
	Paul McKenney, Jeff Mahoney, sahne, linux-kernel

Am 05.08.2014 00:10, schrieb Ram Pai:
> On Mon, Aug 04, 2014 at 11:19:35PM +0200, Richard Weinberger wrote:
>> Am 04.08.2014 18:46, schrieb Eric W. Biederman:
>>> Richard Weinberger <richard.weinberger@gmail.com> writes:
>>
>> /proc is propagating into another mount namespaces that does not care.
>> This happens because systemd creates for several services a mount namespace and sets
>> the root tree to MS_SHARED.
> 
> if propagations are not needed, than set the root of the new mount
> namespace to MS_PRIVATE first and then set it to MS_SHARED.
> 
> MS_PRIVATE will delink the propagations, and MS_SHARED later will enable
> the new mounts to propagate to whoever wants them.

AFAICT this would break systemd's PrivateTmp feature. :(
They want propagation. Such that a systemd service has a private /tmp but sees freshly
mounted filesystems after setting up the namespace.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: MNT_DETACH and mount namespace issue
  2014-08-04 21:19               ` Richard Weinberger
  2014-08-04 22:10                 ` Ram Pai
@ 2014-08-04 22:59                 ` Eric W. Biederman
  1 sibling, 0 replies; 16+ messages in thread
From: Eric W. Biederman @ 2014-08-04 22:59 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Ram Pai, linux-fsdevel, Al Viro, Christoph Hellwig,
	Paul McKenney, Jeff Mahoney, sahne, linux-kernel

Richard Weinberger <richard@nod.at> writes:

> Am 04.08.2014 18:46, schrieb Eric W. Biederman:
>> Richard Weinberger <richard.weinberger@gmail.com> writes:
>> 
>>> On Sat, Aug 2, 2014 at 12:09 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
[snip]
>> 
>> Hmm.  That problem does sound familiar.
>> 
>> Is the problem oversharing?  Is the problem that the mount of /proc in
>> the chroot directory is propogating into other mount namespaces that
>> don't care?
>
> /proc is propagating into another mount namespaces that does not care.
> This happens because systemd creates for several services a mount namespace and sets
> the root tree to MS_SHARED.

Unless you are in the primary mount namespace when having this issue
it would be good to run mount make-rprivate / before mounting your new
/proc.

>> If the problem is over propogating I would argue that KIWI needs to
>> use a mount namespace instead of a chroot and to tweak the mount
>> namespace so the mount of /proc simply does not leak out.
>> 
>> Not that the kernel doesn't need to be fixed but that a design where
>> mounts propogate everywhere is a problem in userspace anyway, and it is
>> likely going to only need to be a handful of lines of code to fix
>> userspace cleanly.  While the kernel fix or fixes are going to require a
>> bit more time.
>
> KIWI can bypass the issue by not using a lazy unmount of /proc.
> But I fear more applications will need fixing.
> While I don't think that it was a wise choice from systemd developers to set
> / shared by default I agree that systemd is not the root cause of the problem.
> It is the messenger.
>
> It is just annoying that applications stopped working correctly after upgrading
> to systemd.

Quite.

The deep issue here is actually the design of how mount propogation
works.  If you want the option of mounts to propogate then you are
required to make them shared to start with then after creating a mount
namespace if you don't want that sharing you have to do
"mount --make-rprivate /".   Essentially all of this is documented
in Documentation/shared-subtress.txt.  Although frankly I have yet to
find a case where I really want shared subtreess.  Though I have found
a few cases where because we have to deal with sharing I using sharing
to make it not a pain.

Ultimately this means is that in every mount namespace that is created
we have to turn off sharing if we don't want it.    Which usually we
don't because mount namespaces are used when complete sharing is
undesirable.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-08-04 23:03 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-30 13:59 [PATCH] vfs: Fix RCU usage in __propagate_umount() Richard Weinberger
2014-07-30 14:20 ` Richard Weinberger
2014-07-30 18:39 ` Paul E. McKenney
2014-07-30 18:39   ` Paul E. McKenney
2014-07-30 20:46 ` MNT_DETACH and mount namespace issue (was: Re: [PATCH] vfs: Fix RCU usage in __propagate_umount()) Richard Weinberger
2014-07-31 22:17   ` MNT_DETACH and mount namespace issue Richard Weinberger
2014-08-01 15:44     ` Ram Pai
2014-08-01 19:20       ` Richard Weinberger
2014-08-01 19:20         ` Richard Weinberger
2014-08-01 22:09         ` Eric W. Biederman
2014-08-04  8:40           ` Richard Weinberger
2014-08-04 16:46             ` Eric W. Biederman
2014-08-04 21:19               ` Richard Weinberger
2014-08-04 22:10                 ` Ram Pai
2014-08-04 22:16                   ` Richard Weinberger
2014-08-04 22:59                 ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.