linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* pivot_root(".", ".") and the fchdir() dance
@ 2019-08-01 13:38 Michael Kerrisk (man-pages)
  2019-08-05 10:36 ` Aleksa Sarai
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-08-01 13:38 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Containers, Stéphane Graber,
	Christian Brauner, Al Viro, lkml, linux-man, Jordan Ogas

Hi Serge, Andy, et al,

I've been looking at doing some updates for the rather inaccurate
pivot_root(2) manual page, and I noticed this 2014 commit in LXC

[[commit 2d489f9e87fa0cccd8a1762680a43eeff2fe1b6e
Author: Serge Hallyn <serge.hallyn@ubuntu.com>
Date:   Sat Sep 20 03:15:44 2014 +0000

    pivot_root: switch to a new mechanism (v2)

    This idea came from Andy Lutomirski.  Instead of using a
    temporary directory for the pivot_root put-old, use "." both
    for new-root and old-root.  Then fchdir into the old root
    temporarily in order to unmount the old-root, and finally
    chdir back into our '/'.
]]

I'd like to add some documentation about the pivot_root(".", ".")
idea, but I have a doubt/question. In the lxc_pivot_root() code we
have these steps

        oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
        newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);

        fchdir(newroot);
        pivot_root(".", ".");

        fchdir(oldroot);      // ****

        mount("", ".", "", MS_SLAVE | MS_REC, NULL);
        umount2(".", MNT_DETACH);

        fchdir(newroot);      // ****

My question: are the two fchdir() calls marked "****" really
necessary? I suspect not. My reasoning:
1. By this point, both the CWD and root dir of the calling process are
in newroot (and so do not keep newroot busy, and thus don't prevent
the unmount).
2. After the pivot_root() operation, there are two mount points
stacked at "/": oldroot and newroot, with oldroot a child mount
stacked on top of newroot (I did some experiments to verify that this
is so, by examination of /proc/self/mountinfo).
3. The umount(".") operation unmounts the topmost mount from the pair
of mounts stacked at "/".

At least, in some separate tests that I've done, things seem to work
as I describe above without the use of the marked fchdir() calls. (My
tests omit the mount(MS_SLAVE) piece, since in my tests I do a
more-or-less equivalent step at an earlier point.

Am I missing something?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-08-01 13:38 pivot_root(".", ".") and the fchdir() dance Michael Kerrisk (man-pages)
@ 2019-08-05 10:36 ` Aleksa Sarai
  2019-08-05 12:29   ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Aleksa Sarai @ 2019-08-05 10:36 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Serge E. Hallyn, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1704 bytes --]

On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> I'd like to add some documentation about the pivot_root(".", ".")
> idea, but I have a doubt/question. In the lxc_pivot_root() code we
> have these steps
> 
>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
> 
>         fchdir(newroot);
>         pivot_root(".", ".");
> 
>         fchdir(oldroot);      // ****

This one is "required" because (as the pivot_root(2) man page states),
it's technically not guaranteed by the kernel that the process's cwd
will be the same after pivot_root(2):

> pivot_root() may or may not change the current root and the current
> working directory of any processes or threads which use the old root
> directory.

Now, if it turns out that we can rely on the current behaviour (and the
man page you're improving is actually inaccurate on this point) then
you're right that this fchdir(2) isn't required.

>         mount("", ".", "", MS_SLAVE | MS_REC, NULL);
>         umount2(".", MNT_DETACH);

>         fchdir(newroot);      // ****

And this one is required because we are in @oldroot at this point, due
to the first fchdir(2). If we don't have the first one, then switching
from "." to "/" in the mount/umount2 calls should fix the issue.

We do something very similar to this in runc as well[1] (though, as the
commit message says, I "borrowed" the idea from LXC).

[1]: https://github.com/opencontainers/runc/commit/f8e6b5af5e120ab7599885bd13a932d970ccc748

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-08-05 10:36 ` Aleksa Sarai
@ 2019-08-05 12:29   ` Michael Kerrisk (man-pages)
  2019-08-05 13:37     ` Aleksa Sarai
  2019-08-06  8:12     ` Philipp Wendler
  0 siblings, 2 replies; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-08-05 12:29 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: mtk.manpages, Serge E. Hallyn, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, Al Viro, werner

[CC += Werner Almesberger, original author of both the system call 
and the manual page.]

Hello Aleksa,

Thank you for your responses. See below.

On 8/5/19 12:36 PM, Aleksa Sarai wrote:
> On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> I'd like to add some documentation about the pivot_root(".", ".")
>> idea, but I have a doubt/question. In the lxc_pivot_root() code we
>> have these steps
>>
>>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>>
>>         fchdir(newroot);
>>         pivot_root(".", ".");
>>
>>         fchdir(oldroot);      // ****
> 
> This one is "required" because (as the pivot_root(2) man page states),
> it's technically not guaranteed by the kernel that the process's cwd
> will be the same after pivot_root(2):
> 
>> pivot_root() may or may not change the current root and the current
>> working directory of any processes or threads which use the old root
>> directory.
>
> 
> Now, if it turns out that we can rely on the current behaviour (and the
> man page you're improving is actually inaccurate on this point) then
> you're right that this fchdir(2) isn't required.

I'm not sure that I follow your logic here. In the old manual page
text that you quote above, it says that [pivot_root() may change the
CWD of any processes that use the old root directory]. Two points 
there:

(1) the first fchdir() has *already* changed the CWD of the calling
process to the new root directory, and 

(2) the manual page implied but did not explicitly say that the
CWD of processes using the old root may be changed *to the new root
directory* (rather than changed to some arbitrary location!);
presumably, omitting to mention that detail explicitly in the manual
page was an oversight, since that is indeed the kernel's behavior.

The point is, the manual page was written 19 years ago and has
barely been changed since then. Even at the time that the system
call was officially released (in Linux 2.4.0), the manual page was
already inaccurate in a number of details, since it was written 
about a year beforehand (during the 2.3 series), and the 
implementation already changed by the time of 2.4.0, but the manual
page was not changed then (or since, but I'm working on that).

The behavior has in practice always been (modulo the introduction
of mount namespaces in 2001/2002):

       pivot_root()  changes  the root directory and the current
       working directory of each process or thread in  the  same
       mount namespace to new_root if they point to the old root
       directory.

Given that this has been the behavior since Linux 2.4.0 was
released, it improbable that this will ever change, since,
notwithstanding what the manual page says, this would be an ABI
breakage.

I hypothesize that the original manual page text, written before
the system call was even officially released, reflects Werner's
belief at the time that perhaps in the not too distant future
the implementation might change. But, 18 years on from 2.4.0,
it has not.

And arguably, the manual page should reflect that reality,
describing what the kernel actually does, rather than speculating
that things might (after 19 years) still sometime change.

>>         mount("", ".", "", MS_SLAVE | MS_REC, NULL);
>>         umount2(".", MNT_DETACH);
> 
>>         fchdir(newroot);      // ****
> 
> And this one is required because we are in @oldroot at this point, due
> to the first fchdir(2). If we don't have the first one, then switching
> from "." to "/" in the mount/umount2 calls should fix the issue.

See my notes above for why I therefore think that the second fchdir()
is also not needed (and therefore why switching from "." to "/" in the
mount()/umount2() calls is unnecessary.

Do you agree with my analysis?

> We do something very similar to this in runc as well[1] (though, as the
> commit message says, I "borrowed" the idea from LXC).
> 
> [1]: https://github.com/opencontainers/runc/commit/f8e6b5af5e120ab7599885bd13a932d970ccc748

Yep -- I saw your code as well, which in fact was what led me back
to the source of the idea in LXC (so, thanks for commenting the
origin of the idea).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-08-05 12:29   ` Michael Kerrisk (man-pages)
@ 2019-08-05 13:37     ` Aleksa Sarai
  2019-08-06 19:35       ` Michael Kerrisk (man-pages)
  2019-08-06  8:12     ` Philipp Wendler
  1 sibling, 1 reply; 25+ messages in thread
From: Aleksa Sarai @ 2019-08-05 13:37 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Serge E. Hallyn, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, Al Viro, werner, Aleksa Sarai

[-- Attachment #1: Type: text/plain, Size: 6251 bytes --]

On 2019-08-05, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> On 8/5/19 12:36 PM, Aleksa Sarai wrote:
> > On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >> I'd like to add some documentation about the pivot_root(".", ".")
> >> idea, but I have a doubt/question. In the lxc_pivot_root() code we
> >> have these steps
> >>
> >>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
> >>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
> >>
> >>         fchdir(newroot);
> >>         pivot_root(".", ".");
> >>
> >>         fchdir(oldroot);      // ****
> > 
> > This one is "required" because (as the pivot_root(2) man page states),
> > it's technically not guaranteed by the kernel that the process's cwd
> > will be the same after pivot_root(2):
> > 
> >> pivot_root() may or may not change the current root and the current
> >> working directory of any processes or threads which use the old root
> >> directory.
> > 
> > Now, if it turns out that we can rely on the current behaviour (and the
> > man page you're improving is actually inaccurate on this point) then
> > you're right that this fchdir(2) isn't required.
> 
> I'm not sure that I follow your logic here. In the old manual page
> text that you quote above, it says that [pivot_root() may change the
> CWD of any processes that use the old root directory]. Two points 
> there:
> 
> (1) the first fchdir() has *already* changed the CWD of the calling
> process to the new root directory, and

Right, I (and presumably the LXC folks as well) must've missed the
qualifier on the end of the sentence and was thinking that it said "you
can't trust CWD after pivot_root(2)".

My follow-up was going to be that we need to be in the old root to
umount, but as you mentioned that shouldn't be necessary either since
the umount will apply to the stacked mount (which is somewhat
counter-intuitively the earlier mount not the later one -- I will freely
admit that I don't understand all of the stacked and tucked mount
logic in VFS).

> (2) the manual page implied but did not explicitly say that the
> CWD of processes using the old root may be changed *to the new root
> directory* (rather than changed to some arbitrary location!);
> presumably, omitting to mention that detail explicitly in the manual
> page was an oversight, since that is indeed the kernel's behavior.
> 
> The point is, the manual page was written 19 years ago and has
> barely been changed since then. Even at the time that the system
> call was officially released (in Linux 2.4.0), the manual page was
> already inaccurate in a number of details, since it was written 
> about a year beforehand (during the 2.3 series), and the 
> implementation already changed by the time of 2.4.0, but the manual
> page was not changed then (or since, but I'm working on that).
> 
> The behavior has in practice always been (modulo the introduction
> of mount namespaces in 2001/2002):
> 
>        pivot_root()  changes  the root directory and the current
>        working directory of each process or thread in  the  same
>        mount namespace to new_root if they point to the old root
>        directory.
> 
> Given that this has been the behavior since Linux 2.4.0 was
> released, it improbable that this will ever change, since,
> notwithstanding what the manual page says, this would be an ABI
> breakage.
> 
> I hypothesize that the original manual page text, written before
> the system call was even officially released, reflects Werner's
> belief at the time that perhaps in the not too distant future
> the implementation might change. But, 18 years on from 2.4.0,
> it has not.
> 
> And arguably, the manual page should reflect that reality,
> describing what the kernel actually does, rather than speculating
> that things might (after 19 years) still sometime change.

I wasn't aware of the history of the man page, and took it as gospel
that we should avoid making assumptions about current's CWD surrounding
a pivot_root(2). Given the history (and that it appears the behaviour
was never intended to be changed after being merged), we should
definitely drop that text to avoid the confusion which has already
caused us container folks to implement this in a
more-convoluted-than-necessary fashion.

In case you haven't noticed already, you might want to also send a patch
to util-linux to also update pivot_root(8) which makes the same mistake
in its text:

> Note that, depending on the implementation of pivot_root, root and cwd
> of the caller may or may not change.

Then again, it's also possible this text is independently just as vague
for other reasons.

> > And this one is required because we are in @oldroot at this point, due
> > to the first fchdir(2). If we don't have the first one, then switching
> > from "." to "/" in the mount/umount2 calls should fix the issue.
> 
> See my notes above for why I therefore think that the second fchdir()
> is also not needed (and therefore why switching from "." to "/" in the
> mount()/umount2() calls is unnecessary.

My gut feeling reading this was that operating on "." will result in you
doing the later mount operations on @newroot (since "." is @newroot) not
on the stacked mount which isn't your CWD.

*But* my gut feeling is obviously wrong (since you've tested it), and I
will again admit I don't understand quite how CWD references interact
with mount operations -- especially in the context of stacked mounts.

> Do you agree with my analysis?

Minus the mount bits that I'm not too sure about (I defer to
Christian/Serge/et al on that point), it seems reasonable to me.

My only real argument for keeping it the way it is, is that it's much
easier (for me, at least) to understand the semantics with explicit
fchdir(2)s. But that's not really a good reason to continue doing it the
way we do it now -- if it's documented in a man page that'd be more than
sufficient to avoid confusion when reading snippets that do it without
the fchdir(2)s.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-08-05 12:29   ` Michael Kerrisk (man-pages)
  2019-08-05 13:37     ` Aleksa Sarai
@ 2019-08-06  8:12     ` Philipp Wendler
  2019-08-06 12:03       ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 25+ messages in thread
From: Philipp Wendler @ 2019-08-06  8:12 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Aleksa Sarai
  Cc: linux-man, Containers, lkml, Andy Lutomirski, Jordan Ogas,
	werner, Al Viro

Hello Michael, hello Aleksa,

Am 05.08.19 um 14:29 schrieb Michael Kerrisk (man-pages):

> On 8/5/19 12:36 PM, Aleksa Sarai wrote:
>> On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>>> I'd like to add some documentation about the pivot_root(".", ".")
>>> idea, but I have a doubt/question. In the lxc_pivot_root() code we
>>> have these steps
>>>
>>>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>>>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>>>
>>>         fchdir(newroot);
>>>         pivot_root(".", ".");
>>>
>>>         fchdir(oldroot);      // ****
>>>
>>>         mount("", ".", "", MS_SLAVE | MS_REC, NULL);
>>>         umount2(".", MNT_DETACH);
>>
>>>         fchdir(newroot);      // ****
>>
>> And this one is required because we are in @oldroot at this point, due
>> to the first fchdir(2). If we don't have the first one, then switching
>> from "." to "/" in the mount/umount2 calls should fix the issue.
> 
> See my notes above for why I therefore think that the second fchdir()
> is also not needed (and therefore why switching from "." to "/" in the
> mount()/umount2() calls is unnecessary.
> 
> Do you agree with my analysis?

If both the second and third fchdir are not required,
then we do not need to bother with file descriptors at all, right?

Indeed, my tests show that the following seems to work fine:

chdir(rootfs)
pivot_root(".", ".")
umount2(".", MNT_DETACH)

I tested that with my own tool[1] that uses user namespaces and marks
everything MS_PRIVATE before, so I do not need the mount(MS_SLAVE) here.

And it works the same with both umount2("/") and umount2(".").

Did I overlook something that makes the file descriptors required?
If not, wouldn't the above snippet make sense as example in the man page?

Greetings
Philipp

[1]: https://github.com/sosy-lab/benchexec/blob/b90aeb034b867711845a453587b73fbe8e4dca68/benchexec/container.py#L735

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-08-06  8:12     ` Philipp Wendler
@ 2019-08-06 12:03       ` Michael Kerrisk (man-pages)
  2019-09-09 10:40         ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-08-06 12:03 UTC (permalink / raw)
  To: Philipp Wendler
  Cc: Aleksa Sarai, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

Hello Philipp,

On Tue, 6 Aug 2019 at 10:12, Philipp Wendler <ml@philippwendler.de> wrote:
>
> Hello Michael, hello Aleksa,
>
> Am 05.08.19 um 14:29 schrieb Michael Kerrisk (man-pages):
>
> > On 8/5/19 12:36 PM, Aleksa Sarai wrote:
> >> On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >>> I'd like to add some documentation about the pivot_root(".", ".")
> >>> idea, but I have a doubt/question. In the lxc_pivot_root() code we
> >>> have these steps
> >>>
> >>>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
> >>>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
> >>>
> >>>         fchdir(newroot);
> >>>         pivot_root(".", ".");
> >>>
> >>>         fchdir(oldroot);      // ****
> >>>
> >>>         mount("", ".", "", MS_SLAVE | MS_REC, NULL);
> >>>         umount2(".", MNT_DETACH);
> >>
> >>>         fchdir(newroot);      // ****
> >>
> >> And this one is required because we are in @oldroot at this point, due
> >> to the first fchdir(2). If we don't have the first one, then switching
> >> from "." to "/" in the mount/umount2 calls should fix the issue.
> >
> > See my notes above for why I therefore think that the second fchdir()
> > is also not needed (and therefore why switching from "." to "/" in the
> > mount()/umount2() calls is unnecessary.
> >
> > Do you agree with my analysis?
>
> If both the second and third fchdir are not required,
> then we do not need to bother with file descriptors at all, right?

Exactly.

> Indeed, my tests show that the following seems to work fine:
>
> chdir(rootfs)
> pivot_root(".", ".")
> umount2(".", MNT_DETACH)

Thanks for the confirmation, That's also exactly what I tested.

> I tested that with my own tool[1] that uses user namespaces and marks
> everything MS_PRIVATE before, so I do not need the mount(MS_SLAVE) here.
>
> And it works the same with both umount2("/") and umount2(".").

Yes.

> Did I overlook something that makes the file descriptors required?

No.

> If not, wouldn't the above snippet make sense as example in the man page?

I have exactly that snippet in a pending change for the manual page :-).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-08-05 13:37     ` Aleksa Sarai
@ 2019-08-06 19:35       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-08-06 19:35 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: mtk.manpages, Serge E. Hallyn, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, Al Viro, werner, Aleksa Sarai

Hello Aleksa,

On 8/5/19 3:37 PM, Aleksa Sarai wrote:
> On 2019-08-05, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> On 8/5/19 12:36 PM, Aleksa Sarai wrote:
>>> On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>>>> I'd like to add some documentation about the pivot_root(".", ".")
>>>> idea, but I have a doubt/question. In the lxc_pivot_root() code we
>>>> have these steps
>>>>
>>>>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>>>>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>>>>
>>>>         fchdir(newroot);
>>>>         pivot_root(".", ".");
>>>>
>>>>         fchdir(oldroot);      // ****
>>>
>>> This one is "required" because (as the pivot_root(2) man page states),
>>> it's technically not guaranteed by the kernel that the process's cwd
>>> will be the same after pivot_root(2):
>>>
>>>> pivot_root() may or may not change the current root and the current
>>>> working directory of any processes or threads which use the old root
>>>> directory.
>>>
>>> Now, if it turns out that we can rely on the current behaviour (and the
>>> man page you're improving is actually inaccurate on this point) then
>>> you're right that this fchdir(2) isn't required.
>>
>> I'm not sure that I follow your logic here. In the old manual page
>> text that you quote above, it says that [pivot_root() may change the
>> CWD of any processes that use the old root directory]. Two points 
>> there:
>>
>> (1) the first fchdir() has *already* changed the CWD of the calling
>> process to the new root directory, and
> 
> Right, I (and presumably the LXC folks as well) must've missed the
> qualifier on the end of the sentence and was thinking that it said "you
> can't trust CWD after pivot_root(2)".
> 
> My follow-up was going to be that we need to be in the old root to
> umount, but as you mentioned that shouldn't be necessary either since
> the umount will apply to the stacked mount (which is somewhat
> counter-intuitively the earlier mount not the later one -- I will freely
> admit that I don't understand all of the stacked and tucked mount
> logic in VFS).

Your not alone. I don't follow that code easily either. But, looking
at the order that the mounts were stacked in /proc/PID/mountinfo
helped clarify things for me.

>> (2) the manual page implied but did not explicitly say that the
>> CWD of processes using the old root may be changed *to the new root
>> directory* (rather than changed to some arbitrary location!);
>> presumably, omitting to mention that detail explicitly in the manual
>> page was an oversight, since that is indeed the kernel's behavior.
>>
>> The point is, the manual page was written 19 years ago and has
>> barely been changed since then. Even at the time that the system
>> call was officially released (in Linux 2.4.0), the manual page was
>> already inaccurate in a number of details, since it was written 
>> about a year beforehand (during the 2.3 series), and the 
>> implementation already changed by the time of 2.4.0, but the manual
>> page was not changed then (or since, but I'm working on that).
>>
>> The behavior has in practice always been (modulo the introduction
>> of mount namespaces in 2001/2002):
>>
>>        pivot_root()  changes  the root directory and the current
>>        working directory of each process or thread in  the  same
>>        mount namespace to new_root if they point to the old root
>>        directory.
>>
>> Given that this has been the behavior since Linux 2.4.0 was
>> released, it improbable that this will ever change, since,
>> notwithstanding what the manual page says, this would be an ABI
>> breakage.
>>
>> I hypothesize that the original manual page text, written before
>> the system call was even officially released, reflects Werner's
>> belief at the time that perhaps in the not too distant future
>> the implementation might change. But, 18 years on from 2.4.0,
>> it has not.
>>
>> And arguably, the manual page should reflect that reality,
>> describing what the kernel actually does, rather than speculating
>> that things might (after 19 years) still sometime change.
> 
> I wasn't aware of the history of the man page, and took it as gospel
> that we should avoid making assumptions about current's CWD surrounding
> a pivot_root(2). Given the history (and that it appears the behaviour
> was never intended to be changed after being merged), we should
> definitely drop that text to avoid the confusion which has already
> caused us container folks to implement this in a
> more-convoluted-than-necessary fashion.
> 
> In case you haven't noticed already, you might want to also send a patch
> to util-linux to also update pivot_root(8) which makes the same mistake
> in its text:
> 
>> Note that, depending on the implementation of pivot_root, root and cwd
>> of the caller may or may not change.
>
> Then again, it's also possible this text is independently just as vague
> for other reasons.

I think that page was also written by Werner, back in the day. So I
think it's vague for the same reasons.
 
>>> And this one is required because we are in @oldroot at this point, due
>>> to the first fchdir(2). If we don't have the first one, then switching
>>> from "." to "/" in the mount/umount2 calls should fix the issue.
>>
>> See my notes above for why I therefore think that the second fchdir()
>> is also not needed (and therefore why switching from "." to "/" in the
>> mount()/umount2() calls is unnecessary.
> 
> My gut feeling reading this was that operating on "." will result in you
> doing the later mount operations on @newroot (since "." is @newroot) not
> on the stacked mount which isn't your CWD.
> 
> *But* my gut feeling is obviously wrong (since you've tested it), and I
> will again admit I don't understand quite how CWD references interact
> with mount operations -- especially in the context of stacked mounts.
> 
>> Do you agree with my analysis?
> 
> Minus the mount bits that I'm not too sure about (I defer to
> Christian/Serge/et al on that point), it seems reasonable to me.

Okay.

> My only real argument for keeping it the way it is, is that it's much
> easier (for me, at least) to understand the semantics with explicit
> fchdir(2)s. But that's not really a good reason to continue doing it the
> way we do it now -- if it's documented in ah man page that'd be more than
> sufficient to avoid confusion when reading snippets that do it without
> the fchdir(2)s.

Yes. I'm wanting to get a some feedback/confirmation from others
before I finalize the changes ti the manual page. The feedback from
you and Philipp has already been helpful. I'm hoping that Serge or
Andy might also chip in.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-08-06 12:03       ` Michael Kerrisk (man-pages)
@ 2019-09-09 10:40         ` Eric W. Biederman
  2019-09-09 14:48           ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2019-09-09 10:40 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Philipp Wendler, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Philipp,
>
> On Tue, 6 Aug 2019 at 10:12, Philipp Wendler <ml@philippwendler.de> wrote:
>>
>> Hello Michael, hello Aleksa,
>>
>> Am 05.08.19 um 14:29 schrieb Michael Kerrisk (man-pages):
>>
>> > On 8/5/19 12:36 PM, Aleksa Sarai wrote:
>> >> On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> >>> I'd like to add some documentation about the pivot_root(".", ".")
>> >>> idea, but I have a doubt/question. In the lxc_pivot_root() code we
>> >>> have these steps
>> >>>
>> >>>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>> >>>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>> >>>
>> >>>         fchdir(newroot);
>> >>>         pivot_root(".", ".");
>> >>>
>> >>>         fchdir(oldroot);      // ****
>> >>>
>> >>>         mount("", ".", "", MS_SLAVE | MS_REC, NULL);
>> >>>         umount2(".", MNT_DETACH);
>> >>
>> >>>         fchdir(newroot);      // ****
>> >>
>> >> And this one is required because we are in @oldroot at this point, due
>> >> to the first fchdir(2). If we don't have the first one, then switching
>> >> from "." to "/" in the mount/umount2 calls should fix the issue.
>> >
>> > See my notes above for why I therefore think that the second fchdir()
>> > is also not needed (and therefore why switching from "." to "/" in the
>> > mount()/umount2() calls is unnecessary.
>> >
>> > Do you agree with my analysis?
>>
>> If both the second and third fchdir are not required,
>> then we do not need to bother with file descriptors at all, right?
>
> Exactly.
>
>> Indeed, my tests show that the following seems to work fine:
>>
>> chdir(rootfs)
>> pivot_root(".", ".")
>> umount2(".", MNT_DETACH)
>
> Thanks for the confirmation, That's also exactly what I tested.
>
>> I tested that with my own tool[1] that uses user namespaces and marks
>> everything MS_PRIVATE before, so I do not need the mount(MS_SLAVE) here.
>>
>> And it works the same with both umount2("/") and umount2(".").
>
> Yes.
>
>> Did I overlook something that makes the file descriptors required?
>
> No.
>
>> If not, wouldn't the above snippet make sense as example in the man page?
>
> I have exactly that snippet in a pending change for the manual page :-).

I have just spotted this conversation and I expect if you are going
to use this example it is probably good to document what is going
on so that people can follow along.

>> chdir(rootfs)
>> pivot_root(".", ".")

At this point the mount stack should be:
old_root
new_root
rootfs

With "." and "/" pointing to new_root.

>> umount2(".", MNT_DETACH)

At this point resolving "." starts with new_root and follows up the
mount stack to old-root.

Ordinarily if you unmount "/" as is happening above you then need to
call chroot and possibly chdir to ensure neither "/" nor "." point to
somewhere other than the unmounted root filesystem.  In this specific
case because "/" and "." resolve to new_root under the filesystem that is
being unmounted that all is well.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-09 10:40         ` Eric W. Biederman
@ 2019-09-09 14:48           ` Michael Kerrisk (man-pages)
  2019-09-09 23:40             ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-09 14:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Philipp Wendler, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

Hello Eric,

Thanks for chiming in; I should have thought to CC you at the start. I
have a question or two, below.

On Mon, 9 Sep 2019 at 12:40, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>
> > Hello Philipp,
> >
> > On Tue, 6 Aug 2019 at 10:12, Philipp Wendler <ml@philippwendler.de> wrote:
> >>
> >> Hello Michael, hello Aleksa,
> >>
> >> Am 05.08.19 um 14:29 schrieb Michael Kerrisk (man-pages):
> >>
> >> > On 8/5/19 12:36 PM, Aleksa Sarai wrote:
> >> >> On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >> >>> I'd like to add some documentation about the pivot_root(".", ".")
> >> >>> idea, but I have a doubt/question. In the lxc_pivot_root() code we
> >> >>> have these steps
> >> >>>
> >> >>>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
> >> >>>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
> >> >>>
> >> >>>         fchdir(newroot);
> >> >>>         pivot_root(".", ".");
> >> >>>
> >> >>>         fchdir(oldroot);      // ****
> >> >>>
> >> >>>         mount("", ".", "", MS_SLAVE | MS_REC, NULL);
> >> >>>         umount2(".", MNT_DETACH);
> >> >>
> >> >>>         fchdir(newroot);      // ****
> >> >>
> >> >> And this one is required because we are in @oldroot at this point, due
> >> >> to the first fchdir(2). If we don't have the first one, then switching
> >> >> from "." to "/" in the mount/umount2 calls should fix the issue.
> >> >
> >> > See my notes above for why I therefore think that the second fchdir()
> >> > is also not needed (and therefore why switching from "." to "/" in the
> >> > mount()/umount2() calls is unnecessary.
> >> >
> >> > Do you agree with my analysis?
> >>
> >> If both the second and third fchdir are not required,
> >> then we do not need to bother with file descriptors at all, right?
> >
> > Exactly.
> >
> >> Indeed, my tests show that the following seems to work fine:
> >>
> >> chdir(rootfs)
> >> pivot_root(".", ".")
> >> umount2(".", MNT_DETACH)
> >
> > Thanks for the confirmation, That's also exactly what I tested.
> >
> >> I tested that with my own tool[1] that uses user namespaces and marks
> >> everything MS_PRIVATE before, so I do not need the mount(MS_SLAVE) here.
> >>
> >> And it works the same with both umount2("/") and umount2(".").
> >
> > Yes.
> >
> >> Did I overlook something that makes the file descriptors required?
> >
> > No.
> >
> >> If not, wouldn't the above snippet make sense as example in the man page?
> >
> > I have exactly that snippet in a pending change for the manual page :-).
>
> I have just spotted this conversation and I expect if you are going
> to use this example it is probably good to document what is going
> on so that people can follow along.

(Sounds reasonable.)

> >> chdir(rootfs)
> >> pivot_root(".", ".")
>
> At this point the mount stack should be:
> old_root
> new_root
> rootfs

In this context, what is 'rootfs'? The initramfs? At least, when I
examine /proc/PID/mountinfo. When I look at the / mount point in
/proc/PID/mountinfo, I see just

   old_root
   new_root

But nothing below 'new_root'. So, I'm a little puzzled.

By the way, why is 'old_root' stacked above 'new_root', do you know? I
mean, in this scenario it turns out to be useful, but it's kind of the
opposite from what I would have expected. (And if this was a
deliverate design decision in pivot_root(), it was never made
explicit.)

> With "." and "/" pointing to new_root.
>
> >> umount2(".", MNT_DETACH)
>
> At this point resolving "." starts with new_root and follows up the
> mount stack to old-root.

Okay.

> Ordinarily if you unmount "/" as is happening above you then need to
> call chroot and possibly chdir to ensure neither "/" nor "." point to
> somewhere other than the unmounted root filesystem.  In this specific
> case because "/" and "." resolve to new_root under the filesystem that is
> being unmounted that all is well.

s/that/then/ ?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-09 14:48           ` Michael Kerrisk (man-pages)
@ 2019-09-09 23:40             ` Eric W. Biederman
  2019-09-10 10:27               ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2019-09-09 23:40 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Philipp Wendler, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Eric,
>
> Thanks for chiming in; I should have thought to CC you at the start. I
> have a question or two, below.
>
> On Mon, 9 Sep 2019 at 12:40, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>>
>> > Hello Philipp,
>> >
>> > On Tue, 6 Aug 2019 at 10:12, Philipp Wendler <ml@philippwendler.de> wrote:
>> >>
>> >> Hello Michael, hello Aleksa,
>> >>
>> >> Am 05.08.19 um 14:29 schrieb Michael Kerrisk (man-pages):
>> >>
>> >> > On 8/5/19 12:36 PM, Aleksa Sarai wrote:
>> >> >> On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> >> >>> I'd like to add some documentation about the pivot_root(".", ".")
>> >> >>> idea, but I have a doubt/question. In the lxc_pivot_root() code we
>> >> >>> have these steps
>> >> >>>
>> >> >>>         oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>> >> >>>         newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>> >> >>>
>> >> >>>         fchdir(newroot);
>> >> >>>         pivot_root(".", ".");
>> >> >>>
>> >> >>>         fchdir(oldroot);      // ****
>> >> >>>
>> >> >>>         mount("", ".", "", MS_SLAVE | MS_REC, NULL);
>> >> >>>         umount2(".", MNT_DETACH);
>> >> >>
>> >> >>>         fchdir(newroot);      // ****
>> >> >>
>> >> >> And this one is required because we are in @oldroot at this point, due
>> >> >> to the first fchdir(2). If we don't have the first one, then switching
>> >> >> from "." to "/" in the mount/umount2 calls should fix the issue.
>> >> >
>> >> > See my notes above for why I therefore think that the second fchdir()
>> >> > is also not needed (and therefore why switching from "." to "/" in the
>> >> > mount()/umount2() calls is unnecessary.
>> >> >
>> >> > Do you agree with my analysis?
>> >>
>> >> If both the second and third fchdir are not required,
>> >> then we do not need to bother with file descriptors at all, right?
>> >
>> > Exactly.
>> >
>> >> Indeed, my tests show that the following seems to work fine:
>> >>
>> >> chdir(rootfs)
>> >> pivot_root(".", ".")
>> >> umount2(".", MNT_DETACH)
>> >
>> > Thanks for the confirmation, That's also exactly what I tested.
>> >
>> >> I tested that with my own tool[1] that uses user namespaces and marks
>> >> everything MS_PRIVATE before, so I do not need the mount(MS_SLAVE) here.
>> >>
>> >> And it works the same with both umount2("/") and umount2(".").
>> >
>> > Yes.
>> >
>> >> Did I overlook something that makes the file descriptors required?
>> >
>> > No.
>> >
>> >> If not, wouldn't the above snippet make sense as example in the man page?
>> >
>> > I have exactly that snippet in a pending change for the manual page :-).
>>
>> I have just spotted this conversation and I expect if you are going
>> to use this example it is probably good to document what is going
>> on so that people can follow along.
>
> (Sounds reasonable.)
>
>> >> chdir(rootfs)
>> >> pivot_root(".", ".")
>>
>> At this point the mount stack should be:
>> old_root
>> new_root
>> rootfs
>
> In this context, what is 'rootfs'? The initramfs? At least, when I
> examine /proc/PID/mountinfo. When I look at the / mount point in
> /proc/PID/mountinfo, I see just
>
>    old_root
>    new_root
>
> But nothing below 'new_root'. So, I'm a little puzzled.

I think that is because Al changed /proc/mounts to not display mounts
that are outside of your current root.  But yes there is typically
the initramfs of file system type rootfs on their.  Even when it isn't
used you have one.  Just to keep everything simple I presume.

I haven't double checked lately to be certain it is there but I expect
it is.

> By the way, why is 'old_root' stacked above 'new_root', do you know? I
> mean, in this scenario it turns out to be useful, but it's kind of the
> opposite from what I would have expected. (And if this was a
> deliverate design decision in pivot_root(), it was never made
> explicit.)

Oh.  It is absolutely explicit and part of the design and it has nothing
to do with this case.

The pivot_root system calls takes two parameters:  new_root and put_old.

In this case the old root is put on put_old (which is the new_root).
And new_root is made the current root.

The pivot_root code looks everything up before it moves anything.   With
the result it is totally immaterrial which order the moves actually
happen in the code.  Further it is pretty much necessary to look
everything up before things are moved because the definition of paths
change.

So it would actually be difficult to have pivot_root(.,.) to do anything
except what it does today.


>> With "." and "/" pointing to new_root.
>>
>> >> umount2(".", MNT_DETACH)
>>
>> At this point resolving "." starts with new_root and follows up the
>> mount stack to old-root.
>
> Okay.
>
>> Ordinarily if you unmount "/" as is happening above you then need to
>> call chroot and possibly chdir to ensure neither "/" nor "." point to
>> somewhere other than the unmounted root filesystem.  In this specific
>> case because "/" and "." resolve to new_root under the filesystem that is
>> being unmounted that all is well.
>
> s/that/then/ ?

Yes.

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-09 23:40             ` Eric W. Biederman
@ 2019-09-10 10:27               ` Michael Kerrisk (man-pages)
  2019-09-10 11:15                 ` Christian Brauner
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-10 10:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, Philipp Wendler, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, werner, Al Viro

Hello Eric,

On 9/10/19 1:40 AM, Eric W. Biederman wrote:

[...]

>>> I have just spotted this conversation and I expect if you are going
>>> to use this example it is probably good to document what is going
>>> on so that people can follow along.
>>
>> (Sounds reasonable.)
>>
>>>>> chdir(rootfs)
>>>>> pivot_root(".", ".")
>>>
>>> At this point the mount stack should be:
>>> old_root
>>> new_root
>>> rootfs
>>
>> In this context, what is 'rootfs'? The initramfs? At least, when I
>> examine /proc/PID/mountinfo. When I look at the / mount point in
>> /proc/PID/mountinfo, I see just
>>
>>    old_root
>>    new_root
>>
>> But nothing below 'new_root'. So, I'm a little puzzled.
> 
> I think that is because Al changed /proc/mounts to not display mounts
> that are outside of your current root.  But yes there is typically
> the initramfs of file system type rootfs on their.  Even when it isn't
> used you have one.  Just to keep everything simple I presume.
> 
> I haven't double checked lately to be certain it is there but I expect
> it is.
> 
>> By the way, why is 'old_root' stacked above 'new_root', do you know? I
>> mean, in this scenario it turns out to be useful, but it's kind of the
>> opposite from what I would have expected. (And if this was a
>> deliverate design decision in pivot_root(), it was never made
>> explicit.)
> 
> Oh.  It is absolutely explicit and part of the design and it has nothing
> to do with this case.
> 
> The pivot_root system calls takes two parameters:  new_root and put_old.
> 
> In this case the old root is put on put_old (which is the new_root).
> And new_root is made the current root.
> 
> The pivot_root code looks everything up before it moves anything.   With
> the result it is totally immaterrial which order the moves actually
> happen in the code.  Further it is pretty much necessary to look
> everything up before things are moved because the definition of paths
> change.
> 
> So it would actually be difficult to have pivot_root(.,.) to do anything
> except what it does today.
> 
> 
>>> With "." and "/" pointing to new_root.
>>>
>>>>> umount2(".", MNT_DETACH)
>>>
>>> At this point resolving "." starts with new_root and follows up the
>>> mount stack to old-root.
>>
>> Okay.
>>
>>> Ordinarily if you unmount "/" as is happening above you then need to
>>> call chroot and possibly chdir to ensure neither "/" nor "." point to
>>> somewhere other than the unmounted root filesystem.  In this specific
>>> case because "/" and "." resolve to new_root under the filesystem that is
>>> being unmounted that all is well.
>>
>> s/that/then/ ?

Thanks for the further clarifications.

All: I plan to add the following text to the manual page:

       new_root and put_old may be the same  directory.   In  particular,
       the following sequence allows a pivot-root operation without need‐
       ing to create and remove a temporary directory:

           chdir(new_root);
           pivot_root(".", ".");
           umount2(".", MNT_DETACH);

       This sequence succeeds because the pivot_root()  call  stacks  the
       old root mount point (old_root) on top of the new root mount point
       at /.  At that point, the calling  process's  root  directory  and
       current  working  directory  refer  to  the  new  root mount point
       (new_root).  During the subsequent umount()  call,  resolution  of
       "."   starts  with  new_root  and then moves up the list of mounts
       stacked at /, with the result that old_root is unmounted.

Look okay?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-10 10:27               ` Michael Kerrisk (man-pages)
@ 2019-09-10 11:15                 ` Christian Brauner
  2019-09-10 11:21                   ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2019-09-10 11:15 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Eric W. Biederman, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

On Tue, Sep 10, 2019 at 12:27:27PM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Eric,
> 
> On 9/10/19 1:40 AM, Eric W. Biederman wrote:
> 
> [...]
> 
> >>> I have just spotted this conversation and I expect if you are going
> >>> to use this example it is probably good to document what is going
> >>> on so that people can follow along.
> >>
> >> (Sounds reasonable.)
> >>
> >>>>> chdir(rootfs)
> >>>>> pivot_root(".", ".")
> >>>
> >>> At this point the mount stack should be:
> >>> old_root
> >>> new_root
> >>> rootfs
> >>
> >> In this context, what is 'rootfs'? The initramfs? At least, when I
> >> examine /proc/PID/mountinfo. When I look at the / mount point in
> >> /proc/PID/mountinfo, I see just
> >>
> >>    old_root
> >>    new_root
> >>
> >> But nothing below 'new_root'. So, I'm a little puzzled.
> > 
> > I think that is because Al changed /proc/mounts to not display mounts
> > that are outside of your current root.  But yes there is typically
> > the initramfs of file system type rootfs on their.  Even when it isn't
> > used you have one.  Just to keep everything simple I presume.
> > 
> > I haven't double checked lately to be certain it is there but I expect
> > it is.
> > 
> >> By the way, why is 'old_root' stacked above 'new_root', do you know? I
> >> mean, in this scenario it turns out to be useful, but it's kind of the
> >> opposite from what I would have expected. (And if this was a
> >> deliverate design decision in pivot_root(), it was never made
> >> explicit.)
> > 
> > Oh.  It is absolutely explicit and part of the design and it has nothing
> > to do with this case.
> > 
> > The pivot_root system calls takes two parameters:  new_root and put_old.
> > 
> > In this case the old root is put on put_old (which is the new_root).
> > And new_root is made the current root.
> > 
> > The pivot_root code looks everything up before it moves anything.   With
> > the result it is totally immaterrial which order the moves actually
> > happen in the code.  Further it is pretty much necessary to look
> > everything up before things are moved because the definition of paths
> > change.
> > 
> > So it would actually be difficult to have pivot_root(.,.) to do anything
> > except what it does today.
> > 
> > 
> >>> With "." and "/" pointing to new_root.
> >>>
> >>>>> umount2(".", MNT_DETACH)
> >>>
> >>> At this point resolving "." starts with new_root and follows up the
> >>> mount stack to old-root.
> >>
> >> Okay.
> >>
> >>> Ordinarily if you unmount "/" as is happening above you then need to
> >>> call chroot and possibly chdir to ensure neither "/" nor "." point to
> >>> somewhere other than the unmounted root filesystem.  In this specific
> >>> case because "/" and "." resolve to new_root under the filesystem that is
> >>> being unmounted that all is well.
> >>
> >> s/that/then/ ?
> 
> Thanks for the further clarifications.
> 
> All: I plan to add the following text to the manual page:
> 
>        new_root and put_old may be the same  directory.   In  particular,
>        the following sequence allows a pivot-root operation without need‐
>        ing to create and remove a temporary directory:
> 
>            chdir(new_root);
>            pivot_root(".", ".");
>            umount2(".", MNT_DETACH);

Hm, should we mention that MS_PRIVATE or MS_SLAVE is usually needed
before the umount2()? Especially for the container case... I think we
discussed this briefly yesterday in person.

> 
>        This sequence succeeds because the pivot_root()  call  stacks  the
>        old root mount point (old_root) on top of the new root mount point
>        at /.  At that point, the calling  process's  root  directory  and
>        current  working  directory  refer  to  the  new  root mount point
>        (new_root).  During the subsequent umount()  call,  resolution  of
>        "."   starts  with  new_root  and then moves up the list of mounts
>        stacked at /, with the result that old_root is unmounted.
> 
> Look okay?
> 
> Thanks,
> 
> Michael
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-10 11:15                 ` Christian Brauner
@ 2019-09-10 11:21                   ` Michael Kerrisk (man-pages)
  2019-09-10 23:06                     ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-10 11:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Eric W. Biederman, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, werner, Al Viro

Hello Christian,

>> All: I plan to add the following text to the manual page:
>>
>>        new_root and put_old may be the same  directory.   In  particular,
>>        the following sequence allows a pivot-root operation without need‐
>>        ing to create and remove a temporary directory:
>>
>>            chdir(new_root);
>>            pivot_root(".", ".");
>>            umount2(".", MNT_DETACH);
> 
> Hm, should we mention that MS_PRIVATE or MS_SLAVE is usually needed
> before the umount2()? Especially for the container case... I think we
> discussed this briefly yesterday in person.
Thanks for noticing. That detail (more precisely: not MS_SHARED) is
already covered in the numerous other changes that I have pending
for this page:

       The following restrictions apply:
       ...
       -  The propagation type of new_root and its parent mount must  not
          be MS_SHARED; similarly, if put_old is an existing mount point,
          its propagation type must not be MS_SHARED.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-10 11:21                   ` Michael Kerrisk (man-pages)
@ 2019-09-10 23:06                     ` Eric W. Biederman
  2019-09-15  8:12                       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2019-09-10 23:06 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christian Brauner, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Christian,
>
>>> All: I plan to add the following text to the manual page:
>>>
>>>        new_root and put_old may be the same  directory.   In  particular,
>>>        the following sequence allows a pivot-root operation without need‐
>>>        ing to create and remove a temporary directory:
>>>
>>>            chdir(new_root);
>>>            pivot_root(".", ".");
>>>            umount2(".", MNT_DETACH);
>> 
>> Hm, should we mention that MS_PRIVATE or MS_SLAVE is usually needed
>> before the umount2()? Especially for the container case... I think we
>> discussed this briefly yesterday in person.
> Thanks for noticing. That detail (more precisely: not MS_SHARED) is
> already covered in the numerous other changes that I have pending
> for this page:
>
>        The following restrictions apply:
>        ...
>        -  The propagation type of new_root and its parent mount must  not
>           be MS_SHARED; similarly, if put_old is an existing mount point,
>           its propagation type must not be MS_SHARED.

Ugh.  That is close but not quite correct.

A better explanation:

    The pivot_root system call will never propagate any changes it makes.
    The pivot_root system call ensures this is safe by verifying that
    none of put_old, the parent of new_root, and parent of the root directory
    have a propagation type of MS_SHARED.

>

The concern from our conversation at the container mini-summit was that
there is a pathology if in your initial mount namespace all of the
mounts are marked MS_SHARED like systemd does (and is almost necessary
if you are going to use mount propagation), that if new_root itself
is MS_SHARED then unmounting the old_root could propagate.

So I believe the desired sequence is:

>>>            chdir(new_root);
+++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>            pivot_root(".", ".");
>>>            umount2(".", MNT_DETACH);

The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
long as it is not MS_SHARED the mount won't propagate back to the
parent mount namespace.

Eric



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-10 23:06                     ` Eric W. Biederman
@ 2019-09-15  8:12                       ` Michael Kerrisk (man-pages)
  2019-09-15 18:17                         ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-15  8:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, Christian Brauner, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, werner, Al Viro

Hello Eric,

On 9/11/19 1:06 AM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> 
>> Hello Christian,
>>
>>>> All: I plan to add the following text to the manual page:
>>>>
>>>>        new_root and put_old may be the same  directory.   In  particular,
>>>>        the following sequence allows a pivot-root operation without need‐
>>>>        ing to create and remove a temporary directory:
>>>>
>>>>            chdir(new_root);
>>>>            pivot_root(".", ".");
>>>>            umount2(".", MNT_DETACH);
>>>
>>> Hm, should we mention that MS_PRIVATE or MS_SLAVE is usually needed
>>> before the umount2()? Especially for the container case... I think we
>>> discussed this briefly yesterday in person.
>> Thanks for noticing. That detail (more precisely: not MS_SHARED) is
>> already covered in the numerous other changes that I have pending
>> for this page:
>>
>>        The following restrictions apply:
>>        ...
>>        -  The propagation type of new_root and its parent mount must  not
>>           be MS_SHARED; similarly, if put_old is an existing mount point,
>>           its propagation type must not be MS_SHARED.
> 
> Ugh.  That is close but not quite correct.
> 
> A better explanation:
> 
>     The pivot_root system call will never propagate any changes it makes.
>     The pivot_root system call ensures this is safe by verifying that
>     none of put_old, the parent of new_root, and parent of the root directory
>     have a propagation type of MS_SHARED.

Thanks for that. However, another question. You text has two changes.
First, I understand why you reword the discussion to indicate the
_purpose_ of the rules. However, you also, AFAICS, list a different set of
of directories that can't be MS_SHARED:

I said: new_root, the parent of new_root, and put_old
You said: the parent of new_root, and put_old, and parent of the
root directory.

Was I wrong on this detail also?

> The concern from our conversation at the container mini-summit was that
> there is a pathology if in your initial mount namespace all of the
> mounts are marked MS_SHARED like systemd does (and is almost necessary
> if you are going to use mount propagation), that if new_root itself
> is MS_SHARED then unmounting the old_root could propagate.
> 
> So I believe the desired sequence is:
> 
>>>>            chdir(new_root);
> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>            pivot_root(".", ".");
>>>>            umount2(".", MNT_DETACH);
> 
> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
> long as it is not MS_SHARED the mount won't propagate back to the
> parent mount namespace.

Thanks. I made that change.

Cheers,

Michael
 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-15  8:12                       ` Michael Kerrisk (man-pages)
@ 2019-09-15 18:17                         ` Eric W. Biederman
  2019-09-23 11:10                           ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2019-09-15 18:17 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christian Brauner, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Eric,
>
> On 9/11/19 1:06 AM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>> 
>>> Hello Christian,
>>>
>>>>> All: I plan to add the following text to the manual page:
>>>>>
>>>>>        new_root and put_old may be the same  directory.   In  particular,
>>>>>        the following sequence allows a pivot-root operation without need‐
>>>>>        ing to create and remove a temporary directory:
>>>>>
>>>>>            chdir(new_root);
>>>>>            pivot_root(".", ".");
>>>>>            umount2(".", MNT_DETACH);
>>>>
>>>> Hm, should we mention that MS_PRIVATE or MS_SLAVE is usually needed
>>>> before the umount2()? Especially for the container case... I think we
>>>> discussed this briefly yesterday in person.
>>> Thanks for noticing. That detail (more precisely: not MS_SHARED) is
>>> already covered in the numerous other changes that I have pending
>>> for this page:
>>>
>>>        The following restrictions apply:
>>>        ...
>>>        -  The propagation type of new_root and its parent mount must  not
>>>           be MS_SHARED; similarly, if put_old is an existing mount point,
>>>           its propagation type must not be MS_SHARED.
>> 
>> Ugh.  That is close but not quite correct.
>> 
>> A better explanation:
>> 
>>     The pivot_root system call will never propagate any changes it makes.
>>     The pivot_root system call ensures this is safe by verifying that
>>     none of put_old, the parent of new_root, and parent of the root directory
>>     have a propagation type of MS_SHARED.
>
> Thanks for that. However, another question. You text has two changes.
> First, I understand why you reword the discussion to indicate the
> _purpose_ of the rules. However, you also, AFAICS, list a different set of
> of directories that can't be MS_SHARED:
>
> I said: new_root, the parent of new_root, and put_old
> You said: the parent of new_root, and put_old, and parent of the
> root directory.


> Was I wrong on this detail also?

That is how I read the code.  The code says:

	if (IS_MNT_SHARED(old_mnt) ||
		IS_MNT_SHARED(new_mnt->mnt_parent) ||
		IS_MNT_SHARED(root_mnt->mnt_parent))
		goto out4;

We both agree on put_old and the parent of new_mnt.

When I look at the code root_mnt comes from the root directory, not new_mnt.

Furthermore those checks fundamentally makes sense as the root directory
and new_root that are moving.  The directory put_old simply has
something moving onto it.

>> The concern from our conversation at the container mini-summit was that
>> there is a pathology if in your initial mount namespace all of the
>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>> if you are going to use mount propagation), that if new_root itself
>> is MS_SHARED then unmounting the old_root could propagate.
>> 
>> So I believe the desired sequence is:
>> 
>>>>>            chdir(new_root);
>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>            pivot_root(".", ".");
>>>>>            umount2(".", MNT_DETACH);
>> 
>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>> long as it is not MS_SHARED the mount won't propagate back to the
>> parent mount namespace.
>
> Thanks. I made that change.

For what it is worth.  The sequence above without the change in mount
attributes will fail if it is necessary to change the mount attributes
as "." is both put_old as well as new_root.

When I initially suggested the change I saw "." was new_root and forgot
"." was also put_old.  So I thought there was a silent danger without
that sequence.

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-15 18:17                         ` Eric W. Biederman
@ 2019-09-23 11:10                           ` Michael Kerrisk (man-pages)
  2019-09-28 15:05                             ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-23 11:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, Christian Brauner, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, werner, Al Viro

Hello Eric,

On 9/15/19 8:17 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> 
>> Hello Eric,
>>
>> On 9/11/19 1:06 AM, Eric W. Biederman wrote:
>>> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>>>
>>>> Hello Christian,
>>>>
>>>>>> All: I plan to add the following text to the manual page:
>>>>>>
>>>>>>        new_root and put_old may be the same  directory.   In  particular,
>>>>>>        the following sequence allows a pivot-root operation without need‐
>>>>>>        ing to create and remove a temporary directory:
>>>>>>
>>>>>>            chdir(new_root);
>>>>>>            pivot_root(".", ".");
>>>>>>            umount2(".", MNT_DETACH);
>>>>>
>>>>> Hm, should we mention that MS_PRIVATE or MS_SLAVE is usually needed
>>>>> before the umount2()? Especially for the container case... I think we
>>>>> discussed this briefly yesterday in person.
>>>> Thanks for noticing. That detail (more precisely: not MS_SHARED) is
>>>> already covered in the numerous other changes that I have pending
>>>> for this page:
>>>>
>>>>        The following restrictions apply:
>>>>        ...
>>>>        -  The propagation type of new_root and its parent mount must  not
>>>>           be MS_SHARED; similarly, if put_old is an existing mount point,
>>>>           its propagation type must not be MS_SHARED.
>>>
>>> Ugh.  That is close but not quite correct.
>>>
>>> A better explanation:
>>>
>>>     The pivot_root system call will never propagate any changes it makes.
>>>     The pivot_root system call ensures this is safe by verifying that
>>>     none of put_old, the parent of new_root, and parent of the root directory
>>>     have a propagation type of MS_SHARED.
>>
>> Thanks for that. However, another question. You text has two changes.
>> First, I understand why you reword the discussion to indicate the
>> _purpose_ of the rules. However, you also, AFAICS, list a different set of
>> of directories that can't be MS_SHARED:
>>
>> I said: new_root, the parent of new_root, and put_old
>> You said: the parent of new_root, and put_old, and parent of the
>> root directory.
> 
> 
>> Was I wrong on this detail also?
> 
> That is how I read the code.  The code says:
> 
> 	if (IS_MNT_SHARED(old_mnt) ||
> 		IS_MNT_SHARED(new_mnt->mnt_parent) ||
> 		IS_MNT_SHARED(root_mnt->mnt_parent))
> 		goto out4;
> 
> We both agree on put_old and the parent of new_mnt.
> 
> When I look at the code root_mnt comes from the root directory, not new_mnt.

Hmm -- I had checked the code when I wrote my text, but somehow
I misread things. Going back to recheck the code, you are obviously
correct. Thanks for catching that.

> Furthermore those checks fundamentally makes sense as the root directory
> and new_root that are moving.  The directory put_old simply has
> something moving onto it.
> 
>>> The concern from our conversation at the container mini-summit was that
>>> there is a pathology if in your initial mount namespace all of the
>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>> if you are going to use mount propagation), that if new_root itself
>>> is MS_SHARED then unmounting the old_root could propagate.
>>>
>>> So I believe the desired sequence is:
>>>
>>>>>>            chdir(new_root);
>>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>            pivot_root(".", ".");
>>>>>>            umount2(".", MNT_DETACH);
>>>
>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>>> long as it is not MS_SHARED the mount won't propagate back to the
>>> parent mount namespace.
>>
>> Thanks. I made that change.
> 
> For what it is worth.  The sequence above without the change in mount
> attributes will fail if it is necessary to change the mount attributes
> as "." is both put_old as well as new_root.
> 
> When I initially suggested the change I saw "." was new_root and forgot
> "." was also put_old.  So I thought there was a silent danger without
> that sequence.

So, now I am a little confused by the comments you added here. Do you
now mean that the 

mount("", ".", MS_SLAVE | MS_REC, NULL);

call is not actually necessary?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-23 11:10                           ` Michael Kerrisk (man-pages)
@ 2019-09-28 15:05                             ` Michael Kerrisk (man-pages)
  2019-09-30 11:42                               ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-09-28 15:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, Christian Brauner, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, werner, Al Viro

Hello Eric,

A ping on my question below. Could you take a look please?

Thanks,

Michael

>>>> The concern from our conversation at the container mini-summit was that
>>>> there is a pathology if in your initial mount namespace all of the
>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>> if you are going to use mount propagation), that if new_root itself
>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>
>>>> So I believe the desired sequence is:
>>>>
>>>>>>>            chdir(new_root);
>>>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>            pivot_root(".", ".");
>>>>>>>            umount2(".", MNT_DETACH);
>>>>
>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>> parent mount namespace.
>>>
>>> Thanks. I made that change.
>>
>> For what it is worth.  The sequence above without the change in mount
>> attributes will fail if it is necessary to change the mount attributes
>> as "." is both put_old as well as new_root.
>>
>> When I initially suggested the change I saw "." was new_root and forgot
>> "." was also put_old.  So I thought there was a silent danger without
>> that sequence.
> 
> So, now I am a little confused by the comments you added here. Do you
> now mean that the 
> 
> mount("", ".", MS_SLAVE | MS_REC, NULL);
> 
> call is not actually necessary?
> 
> Thanks,
> 
> Michael
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-28 15:05                             ` Michael Kerrisk (man-pages)
@ 2019-09-30 11:42                               ` Eric W. Biederman
  2019-10-07 11:02                                 ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2019-09-30 11:42 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christian Brauner, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Eric,
>
> A ping on my question below. Could you take a look please?
>
> Thanks,
>
> Michael
>
>>>>> The concern from our conversation at the container mini-summit was that
>>>>> there is a pathology if in your initial mount namespace all of the
>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>>> if you are going to use mount propagation), that if new_root itself
>>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>>
>>>>> So I believe the desired sequence is:
>>>>>
>>>>>>>>            chdir(new_root);
>>>>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>>            pivot_root(".", ".");
>>>>>>>>            umount2(".", MNT_DETACH);
>>>>>
>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>>> parent mount namespace.
>>>>
>>>> Thanks. I made that change.
>>>
>>> For what it is worth.  The sequence above without the change in mount
>>> attributes will fail if it is necessary to change the mount attributes
>>> as "." is both put_old as well as new_root.
>>>
>>> When I initially suggested the change I saw "." was new_root and forgot
>>> "." was also put_old.  So I thought there was a silent danger without
>>> that sequence.
>> 
>> So, now I am a little confused by the comments you added here. Do you
>> now mean that the 
>> 
>> mount("", ".", MS_SLAVE | MS_REC, NULL);
>> 
>> call is not actually necessary?

Apologies for being slow getting back to you.

To my knowledge there are two cases where pivot_root is used.
- In the initial mount namespace from a ramdisk when mounting root.
  This is the original use case and somewhat historical as rootfs
  (aka an initial ramfs) may not be unmounted.

- When setting up a new mount namespace to jettison all of the mounts
  you don't need.

The sequence:

	chdir(new_root);
        pivot_root(".", ".");
        umount2(".", MNT_DETACH);

is perfect for both use cases (as nothing needs to be known about the
directory layout of the new root filesystem).

In the case when you are setting up a new mount namespace propogating
changes in the mount layout to another mount namespace is fatal.  But
that is not a concern for using that pivot_root sequence above because
pivot_root will fail deterministically if
'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified.

So I would document the above sequence of three system calls in the
man-page.

I would document that pivot_root will fail if propagation would occur.

I would document in pivot_root or under unshare(CLONE_NEWNS) that if
mount propagation is enabled (the default with systemd) that you
need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or
'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount
namespace.  Or mounts will propagate backwards, which is usually
not what people want.

Creating of a mount namespace in a user namespace automatically does
'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
namespace was not created in that user namespace.  AKA creating
a mount namespace in a user namespace does the unshare for you.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-09-30 11:42                               ` Eric W. Biederman
@ 2019-10-07 11:02                                 ` Michael Kerrisk (man-pages)
  2019-10-07 15:46                                   ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-07 11:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, Christian Brauner, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, werner, Al Viro

Hello Eric,

On 9/30/19 2:42 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> 
>> Hello Eric,
>>
>> A ping on my question below. Could you take a look please?
>>
>> Thanks,
>>
>> Michael
>>
>>>>>> The concern from our conversation at the container mini-summit was that
>>>>>> there is a pathology if in your initial mount namespace all of the
>>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>>>> if you are going to use mount propagation), that if new_root itself
>>>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>>>
>>>>>> So I believe the desired sequence is:
>>>>>>
>>>>>>>>>             chdir(new_root);
>>>>>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>>>             pivot_root(".", ".");
>>>>>>>>>             umount2(".", MNT_DETACH);
>>>>>>
>>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>>>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>>>> parent mount namespace.
>>>>>
>>>>> Thanks. I made that change.
>>>>
>>>> For what it is worth.  The sequence above without the change in mount
>>>> attributes will fail if it is necessary to change the mount attributes
>>>> as "." is both put_old as well as new_root.
>>>>
>>>> When I initially suggested the change I saw "." was new_root and forgot
>>>> "." was also put_old.  So I thought there was a silent danger without
>>>> that sequence.
>>>
>>> So, now I am a little confused by the comments you added here. Do you
>>> now mean that the
>>>
>>> mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>
>>> call is not actually necessary?
> 
> Apologies for being slow getting back to you.

NP. Thanks for your reply.

> To my knowledge there are two cases where pivot_root is used.
> - In the initial mount namespace from a ramdisk when mounting root.
>    This is the original use case and somewhat historical as rootfs
>    (aka an initial ramfs) may not be unmounted.
> 
> - When setting up a new mount namespace to jettison all of the mounts
>    you don't need.
> 
> The sequence:
> 
> 	chdir(new_root);
>          pivot_root(".", ".");
>          umount2(".", MNT_DETACH);
> 
> is perfect for both use cases (as nothing needs to be known about the
> directory layout of the new root filesystem).
> 
> In the case when you are setting up a new mount namespace propogating
> changes in the mount layout to another mount namespace is fatal.  But
> that is not a concern for using that pivot_root sequence above because
> pivot_root will fail deterministically if
> 'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified.
> 
> So I would document the above sequence of three system calls in the
> man-page.

Okay. I've changed the example to be just those three calls.

> I would document that pivot_root will fail if propagation would occur.

Yep. That's in the page already.

> I would document in pivot_root or under unshare(CLONE_NEWNS) that if
> mount propagation is enabled (the default with systemd) that you
> need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or
> 'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount
> namespace.  Or mounts will propagate backwards, which is usually
> not what people want.

Thanks. Instead, I have added the following text to
mount_namespaces(7), the page that is referred to by both clone(2) and 
unshare(2) in their discussions of CLONE_NEWNS:

        An   application  that  creates  a  new  mount  namespace
        directly using clone(2) or unshare(2) may desire to  pre‐
        vent  propagation  of  mount events to other mount names‐
        paces (as is is done by unshare(1)).  This can be done by
        changing  the propagation type of mount points in the new
        namesapace to either MS_SLAVE  or  MS_PRIVATE.   using  a
        call such as the following:

                   mount(NULL, "/", MS_SLAVE | MS_REC, NULL);

> Creating of a mount namespace in a user namespace automatically does
> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
> namespace was not created in that user namespace.  AKA creating
> a mount namespace in a user namespace does the unshare for you.

Oh -- I had forgotten that detail. But it is documented
(by you, I think) in mount_namespaces(7):

        *  A  mount  namespace  has  an  owner user namespace.  A
           mount namespace whose owner user namespace is  differ‐
           ent  from the owner user namespace of its parent mount
           namespace is considered a less privileged mount names‐
           pace.

        *  When  creating  a  less  privileged  mount  namespace,
           shared mounts are reduced to  slave  mounts.   (Shared
           and  slave  mounts are discussed below.)  This ensures
           that  mappings  performed  in  less  privileged  mount
           namespaces will not propagate to more privileged mount
           namespaces.

There's one point that description that troubles me. There is a
reference to "parent mount namespace", but as I understand things
there is no parental relationship among mount namespaces instances
(or am I wrong?). Should that wording not be rather something
like "the mount namespace of the process that created this mount
namespace"?

Thanks,

Michael



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-10-07 11:02                                 ` Michael Kerrisk (man-pages)
@ 2019-10-07 15:46                                   ` Eric W. Biederman
  2019-10-08 14:27                                     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2019-10-07 15:46 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christian Brauner, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Eric,
>
> On 9/30/19 2:42 PM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>>
>>> Hello Eric,
>>>
>>> A ping on my question below. Could you take a look please?
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>>>>> The concern from our conversation at the container mini-summit was that
>>>>>>> there is a pathology if in your initial mount namespace all of the
>>>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>>>>> if you are going to use mount propagation), that if new_root itself
>>>>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>>>>
>>>>>>> So I believe the desired sequence is:
>>>>>>>
>>>>>>>>>>             chdir(new_root);
>>>>>>> +++            mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>>>>             pivot_root(".", ".");
>>>>>>>>>>             umount2(".", MNT_DETACH);
>>>>>>>
>>>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE.  So
>>>>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>>>>> parent mount namespace.
>>>>>>
>>>>>> Thanks. I made that change.
>>>>>
>>>>> For what it is worth.  The sequence above without the change in mount
>>>>> attributes will fail if it is necessary to change the mount attributes
>>>>> as "." is both put_old as well as new_root.
>>>>>
>>>>> When I initially suggested the change I saw "." was new_root and forgot
>>>>> "." was also put_old.  So I thought there was a silent danger without
>>>>> that sequence.
>>>>
>>>> So, now I am a little confused by the comments you added here. Do you
>>>> now mean that the
>>>>
>>>> mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>
>>>> call is not actually necessary?
>>
>> Apologies for being slow getting back to you.
>
> NP. Thanks for your reply.
>
>> To my knowledge there are two cases where pivot_root is used.
>> - In the initial mount namespace from a ramdisk when mounting root.
>>    This is the original use case and somewhat historical as rootfs
>>    (aka an initial ramfs) may not be unmounted.
>>
>> - When setting up a new mount namespace to jettison all of the mounts
>>    you don't need.
>>
>> The sequence:
>>
>> 	chdir(new_root);
>>          pivot_root(".", ".");
>>          umount2(".", MNT_DETACH);
>>
>> is perfect for both use cases (as nothing needs to be known about the
>> directory layout of the new root filesystem).
>>
>> In the case when you are setting up a new mount namespace propogating
>> changes in the mount layout to another mount namespace is fatal.  But
>> that is not a concern for using that pivot_root sequence above because
>> pivot_root will fail deterministically if
>> 'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified.
>>
>> So I would document the above sequence of three system calls in the
>> man-page.
>
> Okay. I've changed the example to be just those three calls.
>
>> I would document that pivot_root will fail if propagation would occur.
>
> Yep. That's in the page already.
>
>> I would document in pivot_root or under unshare(CLONE_NEWNS) that if
>> mount propagation is enabled (the default with systemd) that you
>> need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or
>> 'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount
>> namespace.  Or mounts will propagate backwards, which is usually
>> not what people want.
>
> Thanks. Instead, I have added the following text to
> mount_namespaces(7), the page that is referred to by both clone(2) and
> unshare(2) in their discussions of CLONE_NEWNS:
>
>        An   application  that  creates  a  new  mount  namespace
>        directly using clone(2) or unshare(2) may desire to  pre‐
>        vent  propagation  of  mount events to other mount names‐
>        paces (as is is done by unshare(1)).  This can be done by
>        changing  the propagation type of mount points in the new
>        namesapace to either MS_SLAVE  or  MS_PRIVATE.   using  a
>        call such as the following:
>
>                   mount(NULL, "/", MS_SLAVE | MS_REC, NULL);

Yes.

>> Creating of a mount namespace in a user namespace automatically does
>> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
>> namespace was not created in that user namespace.  AKA creating
>> a mount namespace in a user namespace does the unshare for you.
>
> Oh -- I had forgotten that detail. But it is documented
> (by you, I think) in mount_namespaces(7):
>
>        *  A  mount  namespace  has  an  owner user namespace.  A
>           mount namespace whose owner user namespace is  differ‐
>           ent  from the owner user namespace of its parent mount
>           namespace is considered a less privileged mount names‐
>           pace.
>
>        *  When  creating  a  less  privileged  mount  namespace,
>           shared mounts are reduced to  slave  mounts.   (Shared
>           and  slave  mounts are discussed below.)  This ensures
>           that  mappings  performed  in  less  privileged  mount
>           namespaces will not propagate to more privileged mount
>           namespaces.
>
> There's one point that description that troubles me. There is a
> reference to "parent mount namespace", but as I understand things
> there is no parental relationship among mount namespaces instances
> (or am I wrong?). Should that wording not be rather something
> like "the mount namespace of the process that created this mount
> namespace"?

How about "the mount namespace this mount namespace started as a copy of"

You are absolutely correct there is no relationship between mount
namespaces.  There is just the propagation tree between mounts.  (Which
acts similarly to a parent/child relationship but is not at all the same
thing).

Eric


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-10-07 15:46                                   ` Eric W. Biederman
@ 2019-10-08 14:27                                     ` Michael Kerrisk (man-pages)
  2019-10-08 19:40                                       ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-08 14:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, Christian Brauner, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, werner, Al Viro

Hello Eric,

>>> Creating of a mount namespace in a user namespace automatically does
>>> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
>>> namespace was not created in that user namespace.  AKA creating
>>> a mount namespace in a user namespace does the unshare for you.
>>
>> Oh -- I had forgotten that detail. But it is documented
>> (by you, I think) in mount_namespaces(7):
>>
>>        *  A  mount  namespace  has  an  owner user namespace.  A
>>           mount namespace whose owner user namespace is  differ‐
>>           ent  from the owner user namespace of its parent mount
>>           namespace is considered a less privileged mount names‐
>>           pace.
>>
>>        *  When  creating  a  less  privileged  mount  namespace,
>>           shared mounts are reduced to  slave  mounts.   (Shared
>>           and  slave  mounts are discussed below.)  This ensures
>>           that  mappings  performed  in  less  privileged  mount
>>           namespaces will not propagate to more privileged mount
>>           namespaces.
>>
>> There's one point that description that troubles me. There is a
>> reference to "parent mount namespace", but as I understand things
>> there is no parental relationship among mount namespaces instances
>> (or am I wrong?). Should that wording not be rather something
>> like "the mount namespace of the process that created this mount
>> namespace"?
> 
> How about "the mount namespace this mount namespace started as a copy of"
> 
> You are absolutely correct there is no relationship between mount
> namespaces.  There is just the propagation tree between mounts.  (Which
> acts similarly to a parent/child relationship but is not at all the same
> thing).

Thanks. I made the text as follows:

       *  Each  mount  namespace  has  an owner user namespace.  As noted
          above, when a new mount namespace is  created,  it  inherits  a
          copy  of  the  mount  points  from  the  mount namespace of the
          process that created the new mount namespace.  If the two mount
          namespaces are owned by different user namespaces, then the new
          mount namespace is considered less privileged.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-10-08 14:27                                     ` Michael Kerrisk (man-pages)
@ 2019-10-08 19:40                                       ` Eric W. Biederman
  2019-10-08 21:40                                         ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2019-10-08 19:40 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christian Brauner, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Eric,
>
>>>> Creating of a mount namespace in a user namespace automatically does
>>>> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
>>>> namespace was not created in that user namespace.  AKA creating
>>>> a mount namespace in a user namespace does the unshare for you.
>>>
>>> Oh -- I had forgotten that detail. But it is documented
>>> (by you, I think) in mount_namespaces(7):
>>>
>>>        *  A  mount  namespace  has  an  owner user namespace.  A
>>>           mount namespace whose owner user namespace is  differ‐
>>>           ent  from the owner user namespace of its parent mount
>>>           namespace is considered a less privileged mount names‐
>>>           pace.
>>>
>>>        *  When  creating  a  less  privileged  mount  namespace,
>>>           shared mounts are reduced to  slave  mounts.   (Shared
>>>           and  slave  mounts are discussed below.)  This ensures
>>>           that  mappings  performed  in  less  privileged  mount
>>>           namespaces will not propagate to more privileged mount
>>>           namespaces.
>>>
>>> There's one point that description that troubles me. There is a
>>> reference to "parent mount namespace", but as I understand things
>>> there is no parental relationship among mount namespaces instances
>>> (or am I wrong?). Should that wording not be rather something
>>> like "the mount namespace of the process that created this mount
>>> namespace"?
>> 
>> How about "the mount namespace this mount namespace started as a copy of"
>> 
>> You are absolutely correct there is no relationship between mount
>> namespaces.  There is just the propagation tree between mounts.  (Which
>> acts similarly to a parent/child relationship but is not at all the same
>> thing).
>
> Thanks. I made the text as follows:
>
>        *  Each  mount  namespace  has  an owner user namespace.  As noted
>           above, when a new mount namespace is  created,  it  inherits  a
>           copy  of  the  mount  points  from  the  mount namespace of the
>           process that created the new mount namespace.  If the two mount
>           namespaces are owned by different user namespaces, then the new
>           mount namespace is considered less privileged.

I hate to nitpick, but I am going to say that when I read the text above
the phrase "mount namespace of the process that created the new mount
namespace" feels wrong.

Either you use unshare(2) and the mount namespace of the process that
created the mount namespace changes.

Or you use clone(2) and you could argue it is the new child that created
the mount namespace.

Having a different mount namespace at the end of the creation operation
feels like it makes your phrase confusing about what the starting
mount namespace is.  I hate to use references that are ambiguous when
things are changing.

I agree that the term parent is also wrong.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-10-08 19:40                                       ` Eric W. Biederman
@ 2019-10-08 21:40                                         ` Michael Kerrisk (man-pages)
  2019-10-08 22:16                                           ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-10-08 21:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, Christian Brauner, linux-man, Containers, lkml,
	Andy Lutomirski, Jordan Ogas, werner, Al Viro

On 10/8/19 9:40 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> 
>> Hello Eric,
>>
>>>>> Creating of a mount namespace in a user namespace automatically does
>>>>> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
>>>>> namespace was not created in that user namespace.  AKA creating
>>>>> a mount namespace in a user namespace does the unshare for you.
>>>>
>>>> Oh -- I had forgotten that detail. But it is documented
>>>> (by you, I think) in mount_namespaces(7):
>>>>
>>>>        *  A  mount  namespace  has  an  owner user namespace.  A
>>>>           mount namespace whose owner user namespace is  differ‐
>>>>           ent  from the owner user namespace of its parent mount
>>>>           namespace is considered a less privileged mount names‐
>>>>           pace.
>>>>
>>>>        *  When  creating  a  less  privileged  mount  namespace,
>>>>           shared mounts are reduced to  slave  mounts.   (Shared
>>>>           and  slave  mounts are discussed below.)  This ensures
>>>>           that  mappings  performed  in  less  privileged  mount
>>>>           namespaces will not propagate to more privileged mount
>>>>           namespaces.
>>>>
>>>> There's one point that description that troubles me. There is a
>>>> reference to "parent mount namespace", but as I understand things
>>>> there is no parental relationship among mount namespaces instances
>>>> (or am I wrong?). Should that wording not be rather something
>>>> like "the mount namespace of the process that created this mount
>>>> namespace"?
>>>
>>> How about "the mount namespace this mount namespace started as a copy of"
>>>
>>> You are absolutely correct there is no relationship between mount
>>> namespaces.  There is just the propagation tree between mounts.  (Which
>>> acts similarly to a parent/child relationship but is not at all the same
>>> thing).
>>
>> Thanks. I made the text as follows:
>>
>>        *  Each  mount  namespace  has  an owner user namespace.  As noted
>>           above, when a new mount namespace is  created,  it  inherits  a
>>           copy  of  the  mount  points  from  the  mount namespace of the
>>           process that created the new mount namespace.  If the two mount
>>           namespaces are owned by different user namespaces, then the new
>>           mount namespace is considered less privileged.
> 
> I hate to nitpick, 

I love it when you nitpick. Thanks for your attention to the details 
of my wording.

> but I am going to say that when I read the text above
> the phrase "mount namespace of the process that created the new mount
> namespace" feels wrong.
> 
> Either you use unshare(2) and the mount namespace of the process that
> created the mount namespace changes.
> 
> Or you use clone(2) and you could argue it is the new child that created
> the mount namespace.
> 
> Having a different mount namespace at the end of the creation operation
> feels like it makes your phrase confusing about what the starting
> mount namespace is.  I hate to use references that are ambiguous when
> things are changing.
>
> I agree that the term parent is also wrong.

I see what you mean. My wording is imprecise.

So, I tweaked text earlier in the page so that it now reads
as follows:

       A  new  mount  namespace  is  created  using  either  clone(2)  or
       unshare(2) with the CLONE_NEWNS flag.  When a new mount  namespace
       is created, its mount point list is initialized as follows:

       *  If  the  namespace  is  created using clone(2), the mount point
          list of the child's namespace is a copy of the mount point list
          in the parent's namespace.

       *  If  the  namespace is created using unshare(2), the mount point
          list of the new namespace is a copy of the mount point list  in
          the caller's previous mount namespace.

And then I tweaked the text that we are currently discussing to read:

       *  Each mount namespace has an owner user namespace.  As explained
          above,  when  a new mount namespace is created, its mount point
          list is initialized as a  copy  of  the  mount  point  list  of
          another  mount namespace.  If the new namespaces and the names‐
          pace from which the mount point list was copied  are  owned  by
          different user namespaces, then the new mount namespace is con‐
          sidered less privileged.

How does this look to you now?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: pivot_root(".", ".") and the fchdir() dance
  2019-10-08 21:40                                         ` Michael Kerrisk (man-pages)
@ 2019-10-08 22:16                                           ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2019-10-08 22:16 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christian Brauner, linux-man, Containers, lkml, Andy Lutomirski,
	Jordan Ogas, werner, Al Viro

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> On 10/8/19 9:40 PM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>> 
>>> Hello Eric,
>>>
>>>>>> Creating of a mount namespace in a user namespace automatically does
>>>>>> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
>>>>>> namespace was not created in that user namespace.  AKA creating
>>>>>> a mount namespace in a user namespace does the unshare for you.
>>>>>
>>>>> Oh -- I had forgotten that detail. But it is documented
>>>>> (by you, I think) in mount_namespaces(7):
>>>>>
>>>>>        *  A  mount  namespace  has  an  owner user namespace.  A
>>>>>           mount namespace whose owner user namespace is  differ‐
>>>>>           ent  from the owner user namespace of its parent mount
>>>>>           namespace is considered a less privileged mount names‐
>>>>>           pace.
>>>>>
>>>>>        *  When  creating  a  less  privileged  mount  namespace,
>>>>>           shared mounts are reduced to  slave  mounts.   (Shared
>>>>>           and  slave  mounts are discussed below.)  This ensures
>>>>>           that  mappings  performed  in  less  privileged  mount
>>>>>           namespaces will not propagate to more privileged mount
>>>>>           namespaces.
>>>>>
>>>>> There's one point that description that troubles me. There is a
>>>>> reference to "parent mount namespace", but as I understand things
>>>>> there is no parental relationship among mount namespaces instances
>>>>> (or am I wrong?). Should that wording not be rather something
>>>>> like "the mount namespace of the process that created this mount
>>>>> namespace"?
>>>>
>>>> How about "the mount namespace this mount namespace started as a copy of"
>>>>
>>>> You are absolutely correct there is no relationship between mount
>>>> namespaces.  There is just the propagation tree between mounts.  (Which
>>>> acts similarly to a parent/child relationship but is not at all the same
>>>> thing).
>>>
>>> Thanks. I made the text as follows:
>>>
>>>        *  Each  mount  namespace  has  an owner user namespace.  As noted
>>>           above, when a new mount namespace is  created,  it  inherits  a
>>>           copy  of  the  mount  points  from  the  mount namespace of the
>>>           process that created the new mount namespace.  If the two mount
>>>           namespaces are owned by different user namespaces, then the new
>>>           mount namespace is considered less privileged.
>> 
>> I hate to nitpick, 
>
> I love it when you nitpick. Thanks for your attention to the details 
> of my wording.
>
>> but I am going to say that when I read the text above
>> the phrase "mount namespace of the process that created the new mount
>> namespace" feels wrong.
>> 
>> Either you use unshare(2) and the mount namespace of the process that
>> created the mount namespace changes.
>> 
>> Or you use clone(2) and you could argue it is the new child that created
>> the mount namespace.
>> 
>> Having a different mount namespace at the end of the creation operation
>> feels like it makes your phrase confusing about what the starting
>> mount namespace is.  I hate to use references that are ambiguous when
>> things are changing.
>>
>> I agree that the term parent is also wrong.
>
> I see what you mean. My wording is imprecise.
>
> So, I tweaked text earlier in the page so that it now reads
> as follows:
>
>        A  new  mount  namespace  is  created  using  either  clone(2)  or
>        unshare(2) with the CLONE_NEWNS flag.  When a new mount  namespace
>        is created, its mount point list is initialized as follows:
>
>        *  If  the  namespace  is  created using clone(2), the mount point
>           list of the child's namespace is a copy of the mount point list
>           in the parent's namespace.
>
>        *  If  the  namespace is created using unshare(2), the mount point
>           list of the new namespace is a copy of the mount point list  in
>           the caller's previous mount namespace.
>
> And then I tweaked the text that we are currently discussing to read:
>
>        *  Each mount namespace has an owner user namespace.  As explained
>           above,  when  a new mount namespace is created, its mount point
>           list is initialized as a  copy  of  the  mount  point  list  of
>           another  mount namespace.  If the new namespaces and the names‐
>           pace from which the mount point list was copied  are  owned  by
>           different user namespaces, then the new mount namespace is con‐
>           sidered less privileged.
>
> How does this look to you now?

Much better thank you.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-10-08 22:17 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-01 13:38 pivot_root(".", ".") and the fchdir() dance Michael Kerrisk (man-pages)
2019-08-05 10:36 ` Aleksa Sarai
2019-08-05 12:29   ` Michael Kerrisk (man-pages)
2019-08-05 13:37     ` Aleksa Sarai
2019-08-06 19:35       ` Michael Kerrisk (man-pages)
2019-08-06  8:12     ` Philipp Wendler
2019-08-06 12:03       ` Michael Kerrisk (man-pages)
2019-09-09 10:40         ` Eric W. Biederman
2019-09-09 14:48           ` Michael Kerrisk (man-pages)
2019-09-09 23:40             ` Eric W. Biederman
2019-09-10 10:27               ` Michael Kerrisk (man-pages)
2019-09-10 11:15                 ` Christian Brauner
2019-09-10 11:21                   ` Michael Kerrisk (man-pages)
2019-09-10 23:06                     ` Eric W. Biederman
2019-09-15  8:12                       ` Michael Kerrisk (man-pages)
2019-09-15 18:17                         ` Eric W. Biederman
2019-09-23 11:10                           ` Michael Kerrisk (man-pages)
2019-09-28 15:05                             ` Michael Kerrisk (man-pages)
2019-09-30 11:42                               ` Eric W. Biederman
2019-10-07 11:02                                 ` Michael Kerrisk (man-pages)
2019-10-07 15:46                                   ` Eric W. Biederman
2019-10-08 14:27                                     ` Michael Kerrisk (man-pages)
2019-10-08 19:40                                       ` Eric W. Biederman
2019-10-08 21:40                                         ` Michael Kerrisk (man-pages)
2019-10-08 22:16                                           ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).