linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: "Michael Kerrisk \(man-pages\)" <mtk.manpages@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>,
	linux-man <linux-man@vger.kernel.org>,
	Containers <containers@lists.linux-foundation.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Jordan Ogas <jogas@lanl.gov>,
	werner@almesberger.net, Al Viro <viro@ftp.linux.org.uk>
Subject: Re: pivot_root(".", ".") and the fchdir() dance
Date: Mon, 07 Oct 2019 10:46:31 -0500	[thread overview]
Message-ID: <87k19geey0.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <7e4b23df-ab83-3d5a-3dc5-54025e3682cf@gmail.com> (Michael Kerrisk's message of "Mon, 7 Oct 2019 14:02:12 +0300")

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Eric,
>
> On 9/30/19 2:42 PM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>>
>>> Hello Eric,
>>>
>>> A ping on my question below. Could you take a look please?
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>>>>> The concern from our conversation at the container mini-summit was that
>>>>>>> there is a pathology if in your initial mount namespace all of the
>>>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>>>>> if you are going to use mount propagation), that if new_root itself
>>>>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>>>>
>>>>>>> So I believe the desired sequence is:
>>>>>>>
>>>>>>>>>>             chdir(new_root);
>>>>>>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>>>>             pivot_root(".", ".");
>>>>>>>>>>             umount2(".", MNT_DETACH);
>>>>>>>
>>>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>>>>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>>>>> parent mount namespace.
>>>>>>
>>>>>> Thanks. I made that change.
>>>>>
>>>>> For what it is worth.  The sequence above without the change in mount
>>>>> attributes will fail if it is necessary to change the mount attributes
>>>>> as "." is both put_old as well as new_root.
>>>>>
>>>>> When I initially suggested the change I saw "." was new_root and forgot
>>>>> "." was also put_old.  So I thought there was a silent danger without
>>>>> that sequence.
>>>>
>>>> So, now I am a little confused by the comments you added here. Do you
>>>> now mean that the
>>>>
>>>> mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>
>>>> call is not actually necessary?
>>
>> Apologies for being slow getting back to you.
>
> NP. Thanks for your reply.
>
>> To my knowledge there are two cases where pivot_root is used.
>> - In the initial mount namespace from a ramdisk when mounting root.
>>    This is the original use case and somewhat historical as rootfs
>>    (aka an initial ramfs) may not be unmounted.
>>
>> - When setting up a new mount namespace to jettison all of the mounts
>>    you don't need.
>>
>> The sequence:
>>
>> 	chdir(new_root);
>>          pivot_root(".", ".");
>>          umount2(".", MNT_DETACH);
>>
>> is perfect for both use cases (as nothing needs to be known about the
>> directory layout of the new root filesystem).
>>
>> In the case when you are setting up a new mount namespace propogating
>> changes in the mount layout to another mount namespace is fatal.  But
>> that is not a concern for using that pivot_root sequence above because
>> pivot_root will fail deterministically if
>> 'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified.
>>
>> So I would document the above sequence of three system calls in the
>> man-page.
>
> Okay. I've changed the example to be just those three calls.
>
>> I would document that pivot_root will fail if propagation would occur.
>
> Yep. That's in the page already.
>
>> I would document in pivot_root or under unshare(CLONE_NEWNS) that if
>> mount propagation is enabled (the default with systemd) that you
>> need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or
>> 'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount
>> namespace.  Or mounts will propagate backwards, which is usually
>> not what people want.
>
> Thanks. Instead, I have added the following text to
> mount_namespaces(7), the page that is referred to by both clone(2) and
> unshare(2) in their discussions of CLONE_NEWNS:
>
>        An   application  that  creates  a  new  mount  namespace
>        directly using clone(2) or unshare(2) may desire to  pre‐
>        vent  propagation  of  mount events to other mount names‐
>        paces (as is is done by unshare(1)).  This can be done by
>        changing  the propagation type of mount points in the new
>        namesapace to either MS_SLAVE  or  MS_PRIVATE.   using  a
>        call such as the following:
>
>                   mount(NULL, "/", MS_SLAVE | MS_REC, NULL);

Yes.

>> Creating of a mount namespace in a user namespace automatically does
>> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
>> namespace was not created in that user namespace.  AKA creating
>> a mount namespace in a user namespace does the unshare for you.
>
> Oh -- I had forgotten that detail. But it is documented
> (by you, I think) in mount_namespaces(7):
>
>        *  A  mount  namespace  has  an  owner user namespace.  A
>           mount namespace whose owner user namespace is  differ‐
>           ent  from the owner user namespace of its parent mount
>           namespace is considered a less privileged mount names‐
>           pace.
>
>        *  When  creating  a  less  privileged  mount  namespace,
>           shared mounts are reduced to  slave  mounts.   (Shared
>           and  slave  mounts are discussed below.)  This ensures
>           that  mappings  performed  in  less  privileged  mount
>           namespaces will not propagate to more privileged mount
>           namespaces.
>
> There's one point that description that troubles me. There is a
> reference to "parent mount namespace", but as I understand things
> there is no parental relationship among mount namespaces instances
> (or am I wrong?). Should that wording not be rather something
> like "the mount namespace of the process that created this mount
> namespace"?

How about "the mount namespace this mount namespace started as a copy of"

You are absolutely correct there is no relationship between mount
namespaces.  There is just the propagation tree between mounts.  (Which
acts similarly to a parent/child relationship but is not at all the same
thing).

Eric


  reply	other threads:[~2019-10-07 15:47 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-01 13:38 pivot_root(".", ".") and the fchdir() dance Michael Kerrisk (man-pages)
2019-08-05 10:36 ` Aleksa Sarai
2019-08-05 12:29   ` Michael Kerrisk (man-pages)
2019-08-05 13:37     ` Aleksa Sarai
2019-08-06 19:35       ` Michael Kerrisk (man-pages)
2019-08-06  8:12     ` Philipp Wendler
2019-08-06 12:03       ` Michael Kerrisk (man-pages)
2019-09-09 10:40         ` Eric W. Biederman
2019-09-09 14:48           ` Michael Kerrisk (man-pages)
2019-09-09 23:40             ` Eric W. Biederman
2019-09-10 10:27               ` Michael Kerrisk (man-pages)
2019-09-10 11:15                 ` Christian Brauner
2019-09-10 11:21                   ` Michael Kerrisk (man-pages)
2019-09-10 23:06                     ` Eric W. Biederman
2019-09-15  8:12                       ` Michael Kerrisk (man-pages)
2019-09-15 18:17                         ` Eric W. Biederman
2019-09-23 11:10                           ` Michael Kerrisk (man-pages)
2019-09-28 15:05                             ` Michael Kerrisk (man-pages)
2019-09-30 11:42                               ` Eric W. Biederman
2019-10-07 11:02                                 ` Michael Kerrisk (man-pages)
2019-10-07 15:46                                   ` Eric W. Biederman [this message]
2019-10-08 14:27                                     ` Michael Kerrisk (man-pages)
2019-10-08 19:40                                       ` Eric W. Biederman
2019-10-08 21:40                                         ` Michael Kerrisk (man-pages)
2019-10-08 22:16                                           ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k19geey0.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=jogas@lanl.gov \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mtk.manpages@gmail.com \
    --cc=viro@ftp.linux.org.uk \
    --cc=werner@almesberger.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).